Vision in Action: Learning Active Perception from Human Demonstrations
Haoyu Xiong, Xiaomeng Xu, Jimmy Wu, Yifan Hou, Jeannette Bohg, Shuran Song
TL;DR
This work tackles the challenge of enabling robots to actively perceive during manipulation under occlusions by learning from human demonstrations. It introduces ViA, which couples a simple 6-DoF robotic neck, a VR teleoperation interface with an intermediate 3D scene, and a diffusion-policy visuomotor learner using a pretrained DINOv2 encoder. Three multi-stage tasks with occlusions demonstrate that ViA significantly improves final success rates (about $45\%$ over baselines) and highlights the importance of active perception and human-robot observation alignment. The work offers a scalable data-collection paradigm and reveals practical limitations related to depth fidelity and potential extensions in memory and language-conditioned control.
Abstract
We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot's physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot's latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.
