Table of Contents
Fetching ...

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

Yao Mu, Junting Chen, Qinglong Zhang, Shoufa Chen, Qiaojun Yu, Chongjian Ge, Runjian Chen, Zhixuan Liang, Mengkang Hu, Chaofan Tao, Peize Sun, Haibao Yu, Chao Yang, Wenqi Shao, Wenhai Wang, Jifeng Dai, Yu Qiao, Mingyu Ding, Ping Luo

TL;DR

RoboCodeX tackles the challenge of translating multimodal perception and language into robot-specific actions by a tree-structured multimodal code-generation framework. It decomposes a global instruction into sub-tasks and generates executable trajectories via grounded predictions of object affordances, grasp poses, and articulation properties, producing $\tau_i$ trajectories with $g(\cdot)$ under dynamics, collision, and control constraints. It introduces a specialized multimodal reasoning dataset and an iterative self-updating SFT pipeline to align semantic understanding with physical constraints, while leveraging a TSDF-based perception and state-of-the-art grasp/ articulation priors. Empirically, RoboCodeX achieves state-of-the-art performance across manipulation, navigation, and general multimodal reasoning in both simulation and real robots, and demonstrates cross-platform zero-shot transfer by reconfiguring robot parameters rather than retraining. This work suggests a practical path to unify cognitive vision-language models with precise robotic control through executable code generation, enabling scalable, cross-platform embodied AI.

Abstract

Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one navigation task.

RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis

TL;DR

RoboCodeX tackles the challenge of translating multimodal perception and language into robot-specific actions by a tree-structured multimodal code-generation framework. It decomposes a global instruction into sub-tasks and generates executable trajectories via grounded predictions of object affordances, grasp poses, and articulation properties, producing trajectories with under dynamics, collision, and control constraints. It introduces a specialized multimodal reasoning dataset and an iterative self-updating SFT pipeline to align semantic understanding with physical constraints, while leveraging a TSDF-based perception and state-of-the-art grasp/ articulation priors. Empirically, RoboCodeX achieves state-of-the-art performance across manipulation, navigation, and general multimodal reasoning in both simulation and real robots, and demonstrates cross-platform zero-shot transfer by reconfiguring robot parameters rather than retraining. This work suggests a practical path to unify cognitive vision-language models with precise robotic control through executable code generation, enabling scalable, cross-platform embodied AI.

Abstract

Robotic behavior synthesis, the problem of understanding multimodal inputs and generating precise physical control for robots, is an important part of Embodied AI. Despite successes in applying multimodal large language models for high-level understanding, it remains challenging to translate these conceptual understandings into detailed robotic actions while achieving generalization across various scenarios. In this paper, we propose a tree-structured multimodal code generation framework for generalized robotic behavior synthesis, termed RoboCodeX. RoboCodeX decomposes high-level human instructions into multiple object-centric manipulation units consisting of physical preferences such as affordance and safety constraints, and applies code generation to introduce generalization ability across various robotics platforms. To further enhance the capability to map conceptual and perceptual understanding into control commands, a specialized multimodal reasoning dataset is collected for pre-training and an iterative self-updating methodology is introduced for supervised fine-tuning. Extensive experiments demonstrate that RoboCodeX achieves state-of-the-art performance in both simulators and real robots on four different kinds of manipulation tasks and one navigation task.
Paper Structure (22 sections, 7 equations, 9 figures, 5 tables)

This paper contains 22 sections, 7 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 2: Example of robotic behavior synthesis with RoboCodeX. To accomplish the task "Place the banana into the drawer," we first decompose the whole task into a Drawer-centric Unit and a Banana-centric Unit. In the Drawer-centric Unit, the robot is programmed to understand that it must align its gripper with the prismatic joint axis of the drawer, which is the optimal position for movement, considering the drawer's physical limits and trajectory optimization. Conversely, the Banana-centric Unit requires the robot to align its gripper with the table surface normal and close to its center to pick up a banana. The accompanying code generation segment translates these multimodal considerations into executable instructions. For the drawer, the code includes determining the handle's position, executing the grip and pull actions in line with the drawer’s joint axis, and then releasing the handle. For the banana, the code sequences involve aligning the gripper, grasping the banana, moving it to the drawer, and detaching it at the destination.
  • Figure 3: Performance Comparison on pick and place task with diverse objects.
  • Figure 4: Generalization among different types of robots in real world without any fine-tuning. We evaluate RoboCodeX with Franka Emika Panda robot and UR5 robot in real world.
  • Figure 5: Ablation on the utilization of preference, vision adapter, and whether to use general VQA data during fine-tuning. We report the average success rate over 4 kinds of tasks over 50 trials.
  • Figure 6: Failure modes comparison between RoboCodeX and GPT-4V in long-term tasks.
  • ...and 4 more figures