Table of Contents
Fetching ...

BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

Xin Wu, Zhixuan Liang, Yue Ma, Mengkang Hu, Zhiyuan Qin, Xiu Li

TL;DR

BiManiBench addresses the lack of dual arm benchmarks for multimodal large language models by proposing a hierarchical evaluation across three tiers of bimanual manipulation. The framework employs a vision driven agent with action chunking and a Task Adaptive Execution Truncation mechanism to balance open loop efficiency with closed loop safety. Experiments across more than thirty models reveal that while high level planning is strong in many MLLMs, dual arm spatial grounding and precise end effector control remain major bottlenecks, with inter arm interference and sequencing errors common. The work highlights the need to integrate inter arm kinematic constraints and collision avoidance into model architectures and planning heads, shaping future directions for robust bimanual embodied AI.

Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.

BiManiBench: A Hierarchical Benchmark for Evaluating Bimanual Coordination of Multimodal Large Language Models

TL;DR

BiManiBench addresses the lack of dual arm benchmarks for multimodal large language models by proposing a hierarchical evaluation across three tiers of bimanual manipulation. The framework employs a vision driven agent with action chunking and a Task Adaptive Execution Truncation mechanism to balance open loop efficiency with closed loop safety. Experiments across more than thirty models reveal that while high level planning is strong in many MLLMs, dual arm spatial grounding and precise end effector control remain major bottlenecks, with inter arm interference and sequencing errors common. The work highlights the need to integrate inter arm kinematic constraints and collision avoidance into model architectures and planning heads, shaping future directions for robust bimanual embodied AI.

Abstract

Multimodal Large Language Models (MLLMs) have significantly advanced embodied AI, and using them to benchmark robotic intelligence has become a pivotal trend. However, existing frameworks remain predominantly confined to single-arm manipulation, failing to capture the spatio-temporal coordination required for bimanual tasks like lifting a heavy pot. To address this, we introduce BiManiBench, a hierarchical benchmark evaluating MLLMs across three tiers: fundamental spatial reasoning, high-level action planning, and low-level end-effector control. Our framework isolates unique bimanual challenges, such as arm reachability and kinematic constraints, thereby distinguishing perceptual hallucinations from planning failures. Analysis of over 30 state-of-the-art models reveals that despite high-level reasoning proficiency, MLLMs struggle with dual-arm spatial grounding and control, frequently resulting in mutual interference and sequencing errors. These findings suggest the current paradigm lacks a deep understanding of mutual kinematic constraints, highlighting the need for future research to focus on inter-arm collision-avoidance and fine-grained temporal sequencing.
Paper Structure (41 sections, 1 equation, 12 figures, 7 tables)

This paper contains 41 sections, 1 equation, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Overview of BiManiBench. A hierarchical framework evaluating MLLMs across three tiers: spatial reasoning, high-level action planning, and low-level continuous control.
  • Figure 1: High-quality reasoning example.
  • Figure 2: The vision-driven agent framework for BiManiBench. This architecture facilitates a structured cycle of multimodal perception, iterative reasoning, and tiered action formulation for bimanual manipulation. See Section \ref{['agent_design']} for further implementation details.
  • Figure 2: Average-quality reasoning example with spatial ambiguity.
  • Figure 3: Comparison of error type distributions. Analysis of failure modes for (a) GPT-5 and (b) Gemini-2.5-Pro. Inner rings represent primary error categories (Perceptual vs. Planning), while outer rings detail specific failure modes. Detailed definitions are provided in Appendix \ref{['app:error_analysis']}.
  • ...and 7 more figures