Table of Contents
Fetching ...

Vision in Action: Learning Active Perception from Human Demonstrations

Haoyu Xiong, Xiaomeng Xu, Jimmy Wu, Yifan Hou, Jeannette Bohg, Shuran Song

TL;DR

This work tackles the challenge of enabling robots to actively perceive during manipulation under occlusions by learning from human demonstrations. It introduces ViA, which couples a simple 6-DoF robotic neck, a VR teleoperation interface with an intermediate 3D scene, and a diffusion-policy visuomotor learner using a pretrained DINOv2 encoder. Three multi-stage tasks with occlusions demonstrate that ViA significantly improves final success rates (about $45\%$ over baselines) and highlights the importance of active perception and human-robot observation alignment. The work offers a scalable data-collection paradigm and reveals practical limitations related to depth fidelity and potential extensions in memory and language-conditioned control.

Abstract

We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot's physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot's latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.

Vision in Action: Learning Active Perception from Human Demonstrations

TL;DR

This work tackles the challenge of enabling robots to actively perceive during manipulation under occlusions by learning from human demonstrations. It introduces ViA, which couples a simple 6-DoF robotic neck, a VR teleoperation interface with an intermediate 3D scene, and a diffusion-policy visuomotor learner using a pretrained DINOv2 encoder. Three multi-stage tasks with occlusions demonstrate that ViA significantly improves final success rates (about over baselines) and highlights the importance of active perception and human-robot observation alignment. The work offers a scalable data-collection paradigm and reveals practical limitations related to depth fidelity and potential extensions in memory and language-conditioned control.

Abstract

We present Vision in Action (ViA), an active perception system for bimanual robot manipulation. ViA learns task-relevant active perceptual strategies (e.g., searching, tracking, and focusing) directly from human demonstrations. On the hardware side, ViA employs a simple yet effective 6-DoF robotic neck to enable flexible, human-like head movements. To capture human active perception strategies, we design a VR-based teleoperation interface that creates a shared observation space between the robot and the human operator. To mitigate VR motion sickness caused by latency in the robot's physical movements, the interface uses an intermediate 3D scene representation, enabling real-time view rendering on the operator side while asynchronously updating the scene with the robot's latest observations. Together, these design elements enable the learning of robust visuomotor policies for three complex, multi-stage bimanual manipulation tasks involving visual occlusions, significantly outperforming baseline systems.

Paper Structure

This paper contains 12 sections, 7 figures.

Figures (7)

  • Figure 1: Vision in Action (ViA) uses an active head camera to search for the target object (yellow banana) inside the bag. The wrist cameras are ineffective in this visually occluded scenario, as they are constrained by the arm motions.
  • Figure 2: VR Teleoperation Comparison. [Left] Traditional RGB streaming suffers from motion-to-photon latency due to both RGB data transmission latency and robot control latency, often leading to VR motion sickness. [Right] Our system mitigates this by: (a, e) streaming a 3D point cloud in the world frame from RGB-D data, (b, c) performing real-time view rendering based on the user's latest head pose, and (d) asynchronously updating the robot's head and arm poses. This approach enables low-latency viewpoint updates for the user.
  • Figure 3: Task Definitions. We introduce three multi-stage tasks that highlight the critical role of active perception in everyday scenarios. [Left] Third-person view with redarrowsindicating head movements and bluearrowsindicating arm movements. [Middle] Active head camera views across task stages (upper row), and third-person view of robot actions (lower row). [Right] Test scenarios, including training and testing objects for the bag task, and different test configurations for the latter two tasks.
  • Figure 4: Policy Learning Camera Setup Comparison. [ViA] uses a single active head camera that dynamically adjusts its viewpoint to capture task-relevant visual information (e.g., finding a cup hidden inside a shelf). In contrast, [Wrist & Chest cameras] policy often fails due to visual occlusions. For example, in the cup task, the right wrist camera's view is blocked by the upper shelf tier, resulting in insufficient visual cues for grasping. The chest camera also fails to capture task-relevant information due to its fixed viewpoint, even when equipped with a fisheye lens.
  • Figure 5: Policy Learning Camera Setup Comparison Results. We report stage-wise success rates across the three tasks to demonstrate the effectiveness of our active head camera [ViA] compared to two baseline configurations: [Active Head & Wrist Cameras] and [Chest & Wrist Cameras].
  • ...and 2 more figures