Table of Contents
Fetching ...

Towards Proprioception-Aware Embodied Planning for Dual-Arm Humanoid Robots

Boyu Li, Siyuan He, Hang Xu, Haoqi Yuan, Xinrun Xu, Yu Zang, Liwei Hu, Junpeng Yue, Zhenxiong Jiang, Pengbo Hu, Börje F. Karlsson, Yehui Tang, Zongqing Lu

TL;DR

This work addresses the scarcity of dual-arm humanoid planning platforms and embodiment grounding in LLM-driven robotics by introducing DualTHOR, a continuous, physically grounded dual-arm simulator with a contingency mechanism, and Proprio-MLLM, a proprioception-aware LLM model enhanced with a motion-based position embedding and a cross-spatial encoder. The proposed approach integrates proprioceptive information into a multimodal alignment framework, enabling longer-horizon planning and improved spatial reasoning for bimanual tasks, and it demonstrates a notable average improvement of $19.75\%$ in planning performance over strong baselines across a large, diverse set of tasks. Key contributions include a dual-arm task benchmark, a motion-tokenizer-based alignment pipeline, and a cross-modal fusion strategy that grounds planning in the robot’s embodiment. The results highlight the importance of embodiment grounding for robust dual-arm planning and provide a benchmark and methodology to advance embodied intelligence in household humanoid robotics.

Abstract

In recent years, Multimodal Large Language Models (MLLMs) have demonstrated the ability to serve as high-level planners, enabling robots to follow complex human instructions. However, their effectiveness, especially in long-horizon tasks involving dual-arm humanoid robots, remains limited. This limitation arises from two main challenges: (i) the absence of simulation platforms that systematically support task evaluation and data collection for humanoid robots, and (ii) the insufficient embodiment awareness of current MLLMs, which hinders reasoning about dual-arm selection logic and body positions during planning. To address these issues, we present DualTHOR, a new dual-arm humanoid simulator, with continuous transition and a contingency mechanism. Building on this platform, we propose Proprio-MLLM, a model that enhances embodiment awareness by incorporating proprioceptive information with motion-based position embedding and a cross-spatial encoder. Experiments show that, while existing MLLMs struggle in this environment, Proprio-MLLM achieves an average improvement of 19.75% in planning performance. Our work provides both an essential simulation platform and an effective model to advance embodied intelligence in humanoid robotics. The code is available at https://anonymous.4open.science/r/DualTHOR-5F3B.

Towards Proprioception-Aware Embodied Planning for Dual-Arm Humanoid Robots

TL;DR

This work addresses the scarcity of dual-arm humanoid planning platforms and embodiment grounding in LLM-driven robotics by introducing DualTHOR, a continuous, physically grounded dual-arm simulator with a contingency mechanism, and Proprio-MLLM, a proprioception-aware LLM model enhanced with a motion-based position embedding and a cross-spatial encoder. The proposed approach integrates proprioceptive information into a multimodal alignment framework, enabling longer-horizon planning and improved spatial reasoning for bimanual tasks, and it demonstrates a notable average improvement of in planning performance over strong baselines across a large, diverse set of tasks. Key contributions include a dual-arm task benchmark, a motion-tokenizer-based alignment pipeline, and a cross-modal fusion strategy that grounds planning in the robot’s embodiment. The results highlight the importance of embodiment grounding for robust dual-arm planning and provide a benchmark and methodology to advance embodied intelligence in household humanoid robotics.

Abstract

In recent years, Multimodal Large Language Models (MLLMs) have demonstrated the ability to serve as high-level planners, enabling robots to follow complex human instructions. However, their effectiveness, especially in long-horizon tasks involving dual-arm humanoid robots, remains limited. This limitation arises from two main challenges: (i) the absence of simulation platforms that systematically support task evaluation and data collection for humanoid robots, and (ii) the insufficient embodiment awareness of current MLLMs, which hinders reasoning about dual-arm selection logic and body positions during planning. To address these issues, we present DualTHOR, a new dual-arm humanoid simulator, with continuous transition and a contingency mechanism. Building on this platform, we propose Proprio-MLLM, a model that enhances embodiment awareness by incorporating proprioceptive information with motion-based position embedding and a cross-spatial encoder. Experiments show that, while existing MLLMs struggle in this environment, Proprio-MLLM achieves an average improvement of 19.75% in planning performance. Our work provides both an essential simulation platform and an effective model to advance embodied intelligence in humanoid robotics. The code is available at https://anonymous.4open.science/r/DualTHOR-5F3B.

Paper Structure

This paper contains 20 sections, 6 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: DualTHOR is a novel simulator specifically tailored for dual-arm humanoid robots, while still preserving the diversity and realism of scenarios in previous AI2-THOR series simulators. As current MLLMs have limited effectiveness in planning for dual-arm embodied tasks, we propose Proprio-MLLM to achieve proprioception-aware, embodiment-grounded planning.
  • Figure 2: Example scenes of different rooms in DualTHOR. The types and quantities of objects vary across rooms, and the humanoid robot is capable of interacting with all objects within each room.
  • Figure 3: Example of picking up a "pourable" cup of coffee. The possible results include success (80%), coffee spill (10%), and mug broken (10%). DualTHOR provides both visual observations and environmental feedback after the robot executes an action, enabling the evaluation of the effectiveness of the current plan and the acquisition of information necessary for MLLM re-planning.
  • Figure 4: The framework of Proprio-MLLM. By incorporating proprioceptive information, we propose a multimodal alignment large language model, Proprio-MLLM. We introduce a motion-based position embedding method and a cross-spatial encoder, increasing the model’s embodiment awareness and spatial reasoning in dual-arm tasks.
  • Figure 5: Distribution of objects and tasks.
  • ...and 1 more figures