Table of Contents
Fetching ...

BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark

Nikita Chernyadev, Nicholas Backshall, Xiao Ma, Yunfan Lu, Younggyo Seo, Stephen James

TL;DR

BiGym addresses the lack of realistic benchmarks for demo-driven mobile bi-manual manipulation by offering 40 tasks with human demonstrations in a humanoid embodiment. Built on a MuJoCo-based Unitree H1 platform, it provides multi-modal observations and flexible action modes to support imitation learning and demo-driven RL under sparse rewards. The paper evaluates a range of IL and RL methods, finding that generative policies (ACT, Diffusion Policy) perform best on BiGym's noisy, multi-modal data, though many long-horizon tasks remain difficult. The benchmark, datasets, and tools are poised to drive advances in memory, belief estimation, and hierarchical planning for humanoid mobile manipulation.

Abstract

We introduce BiGym, a new benchmark and learning environment for mobile bi-manual demo-driven robotic manipulation. BiGym features 40 diverse tasks set in home environments, ranging from simple target reaching to complex kitchen cleaning. To capture the real-world performance accurately, we provide human-collected demonstrations for each task, reflecting the diverse modalities found in real-world robot trajectories. BiGym supports a variety of observations, including proprioceptive data and visual inputs such as RGB, and depth from 3 camera views. To validate the usability of BiGym, we thoroughly benchmark the state-of-the-art imitation learning algorithms and demo-driven reinforcement learning algorithms within the environment and discuss the future opportunities.

BiGym: A Demo-Driven Mobile Bi-Manual Manipulation Benchmark

TL;DR

BiGym addresses the lack of realistic benchmarks for demo-driven mobile bi-manual manipulation by offering 40 tasks with human demonstrations in a humanoid embodiment. Built on a MuJoCo-based Unitree H1 platform, it provides multi-modal observations and flexible action modes to support imitation learning and demo-driven RL under sparse rewards. The paper evaluates a range of IL and RL methods, finding that generative policies (ACT, Diffusion Policy) perform best on BiGym's noisy, multi-modal data, though many long-horizon tasks remain difficult. The benchmark, datasets, and tools are poised to drive advances in memory, belief estimation, and hierarchical planning for humanoid mobile manipulation.

Abstract

We introduce BiGym, a new benchmark and learning environment for mobile bi-manual demo-driven robotic manipulation. BiGym features 40 diverse tasks set in home environments, ranging from simple target reaching to complex kitchen cleaning. To capture the real-world performance accurately, we provide human-collected demonstrations for each task, reflecting the diverse modalities found in real-world robot trajectories. BiGym supports a variety of observations, including proprioceptive data and visual inputs such as RGB, and depth from 3 camera views. To validate the usability of BiGym, we thoroughly benchmark the state-of-the-art imitation learning algorithms and demo-driven reinforcement learning algorithms within the environment and discuss the future opportunities.
Paper Structure (14 sections, 5 figures, 2 tables)

This paper contains 14 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: BiGym focuses on mobile manipulation with home assistance humanoids. We provide 40 tasks ranging from simple mobile target reaching to complex dishwasher manipulations. Specifically, each task comes with demonstrations recorded by human demonstrators and can be used to benchmark both imitation learning and reinforcement learning algorithms.
  • Figure 2: (a) BiGym builds upon Unitree H1 robot with 3 RGB-D cameras at the head, left wrist, and right wrist. We collect human demonstrations by tele-operating with VR devices. BiGym allows users to control the humanoid in either whole-body mode, which considers both locomotion and manipulation, or the bi-manual mode, which simplifies the locomotion with a predefined controller for the lower-body. (b) BiGym provides human-collected multi-modal demonstrations for tasks, e.g., in reach_target_multi_modal, the agent can finish the task by reaching the target with either the left or right hand.
  • Figure 3: Visualisations of arm wrist position distributions of BiGym and RLBench. We visualise the wrist positions of both BiGym human collected trajectories on the reach_target_multi_modal and the wall_cupboard_open task, as well as the RLBench reach_target and the put_knife_on_chopping_board task. The trajectories of BiGym are noisy, multi-modal, but smooth in general, but the motion planner generated trajectories of RLBench are either straight lines or unnatural.
  • Figure 4: Example usage of the BiGym Environment for training a reinforcement learning agent. Demonstrations are pulled from a remote store and cached locally. Users can also customise their action modes or use the off-the-shelf JointPositionActionMode with flags to switch between the bi-manual or whole-body action modes with either absolute or delta actions.
  • Figure 5: The environment run speed of BiGym with (a) different number of cameras and (b) different action modes. In (a), we use the bi-manual control method for measuring the performance.