Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces

Gen Luo*, Ganlin Yang*, Ziyang Gong*, Guanzhou Chen*, Haonan Duan, Erfei Cui,
Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, Shenglong Ye, Lewei Lu,
Jingbo Wang, Wenhai Wang, Jifeng Dai, Yu Qiao, Rongrong Ji, Xizhou Zhu
Shanghai AI Laboratory Tsinghua University
University of Science and Technology of China Shanghai Jiao Tong University
Xiamen University SenseTime Research Zhejiang University Nanjing University
* Equal contribution.

Video



Abstract

The remarkable progress of Multimodal Large Language Models (MLLMs) has attracted increasing attention to extend them to physical entities like legged robot. This typically requires MLLMs to not only grasp multimodal understanding abilities, but also integrate visual-spatial reasoning and physical interaction capabilities. Nevertheless, existing methods struggle to unify these capabilities due to their fundamental differences. In this paper, we present the Visual Embodied Brain (VeBrain), a unified framework for perception, reasoning, and control in real world. VeBrain reformulates robotic control into common text-based MLLM tasks in the 2D visual space, thus unifying the objectives and mapping spaces of different tasks. Then, a novel robotic adapter is proposed to convert textual control signals from MLLMs to motion policies of real robots. From the data perspective, we further introduce VeBrain-600k, a high-quality instruction dataset encompassing various capabilities of VeBrain. In VeBrain-600k, we take hundreds of hours to collect, curate and annotate the data, and adopt multimodal chain-of-thought (CoT) to mix the different capabilities into a single conversation. Extensive experiments on 13 multimodal benchmarks and 5 spatial intelligence benchmarks demonstrate the superior performance of VeBrain to existing MLLMs like Qwen2.5-VL. When deployed to legged robots and robotic arms, VeBrain shows strong adaptability, flexibility, and compositional capabilities compared to existing methods. For example, compared to Qwen2.5-VL, VeBrain not only achieves substantial gains on MMVet by +5.6%, but also excels in legged robot tasks with +50% average gains.

Demonstrations

Manipulation Tasks

Long Horizon (Carrot)
Long Horizon (Pepper)
Move In (Banana)
Move In (Pepper)
Move Out (Carrot)
Open Drawer


Locomotion Tasks

Complex Transport
Complex Interaction (Shake)
Complex Find (Pineapple)
Find (Orange)
Interaction (Touch)
Transport

VeBrain Architecture

VeBrain establishes a closed-loop control system that integrates the MLLM and the robotic adapter. The MLLM is responsible for understanding and thinking. Specifically, the MLLM mainly handles two tasks:

  • (1) Keypoint detection. Based on the visual input, predict the keypoints required to complete the task.
  • (2) Skill recognition. Generate the semantic action to execute upon arriving the keypoint or target.



The robotic adapter is responsible for converting the MLLM decisions into executable policies. With a modular and flexible design, it can adapt to different robotic platforms and tasks. It consists of four main components:
  • (1) Point tracker. During the robot's movement, this module continuously updates the keypoints from the perspective in real time.
  • (2) Movement controller. This module leverages depth information captured by the RGB-D camera to convert 2D coordinates into 3D control commands.
  • (3) Skill executor. This module executes the collected control policies, such as sitting or grasping.
  • (4) Dynamic takeover. When target loss or policy failure occurs, this module automatically triggers the MLLM to replan.

VeBrain-600k

VeBrain-600k contains extensive datasets covering basic capabilities of VeBrain.

  • (1) 200k for multimodal understanding. These samples integrate images, videos, and text, sourced from datasets such as ShareGPT4V and MMInstruct.
  • (2) 312k for spatial reasoning. Generated using ScanNet point cloud data, these samples cover tasks involving counting, measuring distances, object sizes, and other forms of spatial understanding.
  • (3) 88k for robot control. This module executes the collected control policies, such as sitting or grasping.

Performance

Ablation Study

Multimodal Benchmarks

3D Spatial Benchmarks

Real-world Evaluations

Citation


  @article{luo2025visual,
    title={Visual Embodied Brain: Let Multimodal Large Language Models See, Think, and Control in Spaces},
    author={Luo, Gen and Yang, Ganlin and Gong, Ziyang and Chen, Guanzhou and Duan, Haonan and Cui, Erfei and Tong, Ronglei and Hou, Zhi and Zhang, Tianyi and Chen, Zhe and others},
    journal={arXiv preprint arXiv:2506.00123},
    year={2025}
  }

  

Acknowledgement

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.