Table of Contents
Fetching ...

Active Stereo-Camera Outperforms Multi-Sensor Setup in ACT Imitation Learning for Humanoid Manipulation

Robin Kühn, Moritz Schappler, Thomas Seel, Dennis Bank

Abstract

The complexity of teaching humanoid robots new tasks is one of the major reasons hindering their widespread adoption in the industry. While Imitation Learning (IL), particularly Action Chunking with Transformers (ACT), enables rapid task acquisition, there is no consensus yet on the optimal sensory hardware required for manipulation tasks. This paper benchmarks 14 sensor combinations on the Unitree G1 humanoid robot equipped with three-finger hands for two manipulation tasks. We explicitly evaluate the integration of tactile and proprioceptive modalities alongside active vision. Our analysis demonstrates that strategic sensor selection can outperform complex configurations in data-limited regimes while reducing computational overhead. We develop an open-source Unified Ablation Framework that utilizes sensor masking on a comprehensive master dataset. Results indicate that additional modalities often degrade performance for IL with limited data. A minimal active stereo-camera setup outperformed complex multi-sensor configurations, achieving 87.5% success in a spatial generalization task and 94.4% in a structured manipulation task. Conversely, adding pressure sensors to this setup reduced success to 67.3% in the latter task due to a low signal-to-noise ratio. We conclude that in data-limited regimes, active vision offers a superior trade-off between robustness and complexity. While tactile modalities may require larger datasets to be effective, our findings validate that strategic sensor selection is critical for designing an efficient learning process.

Active Stereo-Camera Outperforms Multi-Sensor Setup in ACT Imitation Learning for Humanoid Manipulation

Abstract

The complexity of teaching humanoid robots new tasks is one of the major reasons hindering their widespread adoption in the industry. While Imitation Learning (IL), particularly Action Chunking with Transformers (ACT), enables rapid task acquisition, there is no consensus yet on the optimal sensory hardware required for manipulation tasks. This paper benchmarks 14 sensor combinations on the Unitree G1 humanoid robot equipped with three-finger hands for two manipulation tasks. We explicitly evaluate the integration of tactile and proprioceptive modalities alongside active vision. Our analysis demonstrates that strategic sensor selection can outperform complex configurations in data-limited regimes while reducing computational overhead. We develop an open-source Unified Ablation Framework that utilizes sensor masking on a comprehensive master dataset. Results indicate that additional modalities often degrade performance for IL with limited data. A minimal active stereo-camera setup outperformed complex multi-sensor configurations, achieving 87.5% success in a spatial generalization task and 94.4% in a structured manipulation task. Conversely, adding pressure sensors to this setup reduced success to 67.3% in the latter task due to a low signal-to-noise ratio. We conclude that in data-limited regimes, active vision offers a superior trade-off between robustness and complexity. While tactile modalities may require larger datasets to be effective, our findings validate that strategic sensor selection is critical for designing an efficient learning process.

Paper Structure

This paper contains 20 sections, 1 equation, 7 figures, 1 table.

Figures (7)

  • Figure 2: Custom ACT Architecture: Schematic representation integrating pressure sensors, joint velocities, and torques into the state observation vector. Building upon: zhao2023learningfinegrainedbimanualmanipulationcadene2024lerobotcheng2024tv.
  • Figure 3: Teleoperation Setup: The operator controls the Unitree G1 (with three-finger hands) via VR. Head movements are synchronized, enabling the collection of active vision data.
  • Figure 4: Unified Ablation Framework: To prevent human variance from influencing results, we record a master dataset with all sensors. During training, a masking module creates specific configurations (e.g., $A$, $WA-P$) from identical demonstration data.
  • Figure 5: Execution time vs. success rate (Task 1): The maximalist setup ($WA-P$; marked with ① in the plot) achieves the best combination of execution time and success. However, using only a single active stereo-camera ($A$ ②) offers nearly identical results with a fraction of the state-vector and thereby computational complexity. Notably, simply adding pressure sensors ($A-P$ ③) causes a performance drop compared to the camera-only baseline. Execution time vs. success rate (Task 2): In generalization tasks, the minimal active setup ($A$ ④) dominates with the combination of the highest success (mean 87.5%) and the fastest execution time with a mean of 0.4 min. Conversely, setups involving static cameras (e.g., $S$ ⑤, $S_{\text{L}}WA\text{-}P$ ⑥) yield suboptimal performance and increased execution times.
  • Figure 6: Qualitative results of the Sort Cans task ($A$ policy). The picture links to a video of the autonomous task execution.
  • ...and 2 more figures