Table of Contents
Fetching ...

Observer Actor: Active Vision Imitation Learning with Sparse View Gaussian Splatting

Yilong Wang, Cheng Qian, Ruomeng Fan, Edward Johns

TL;DR

ObAct addresses occlusion and limited field of view in robotic manipulation by decoupling perception (observer) from action (actor) and optimizing the test-time viewpoint using sparse-view 3D Gaussian Splatting. The method extends trajectory transfer and behavior cloning to view-conditioned settings, enabling ambidextrous inference and improved data efficiency. Experiments on a real dual-arm system show substantial performance gains over static-camera baselines in both occluded and non-occluded scenarios, illustrating the practical impact of active vision with fast scene representations. The work highlights a scalable approach to robust manipulation under viewpoint variability and occlusions, with clear avenues for faster pipelines and richer multi-arm configurations.

Abstract

We propose Observer Actor (ObAct), a novel framework for active vision imitation learning in which the observer moves to optimal visual observations for the actor. We study ObAct on a dual-arm robotic system equipped with wrist-mounted cameras. At test time, ObAct dynamically assigns observer and actor roles: the observer arm constructs a 3D Gaussian Splatting (3DGS) representation from three images, virtually explores this to find an optimal camera pose, then moves to this pose; the actor arm then executes a policy using the observer's observations. This formulation enhances the clarity and visibility of both the object and the gripper in the policy's observations. As a result, we enable the training of ambidextrous policies on observations that remain closer to the occlusion-free training distribution, leading to more robust policies. We study this formulation with two existing imitation learning methods -- trajectory transfer and behavior cloning -- and experiments show that ObAct significantly outperforms static-camera setups: trajectory transfer improves by 145% without occlusion and 233% with occlusion, while behavior cloning improves by 75% and 143%, respectively. Videos are available at https://obact.github.io.

Observer Actor: Active Vision Imitation Learning with Sparse View Gaussian Splatting

TL;DR

ObAct addresses occlusion and limited field of view in robotic manipulation by decoupling perception (observer) from action (actor) and optimizing the test-time viewpoint using sparse-view 3D Gaussian Splatting. The method extends trajectory transfer and behavior cloning to view-conditioned settings, enabling ambidextrous inference and improved data efficiency. Experiments on a real dual-arm system show substantial performance gains over static-camera baselines in both occluded and non-occluded scenarios, illustrating the practical impact of active vision with fast scene representations. The work highlights a scalable approach to robust manipulation under viewpoint variability and occlusions, with clear avenues for faster pipelines and richer multi-arm configurations.

Abstract

We propose Observer Actor (ObAct), a novel framework for active vision imitation learning in which the observer moves to optimal visual observations for the actor. We study ObAct on a dual-arm robotic system equipped with wrist-mounted cameras. At test time, ObAct dynamically assigns observer and actor roles: the observer arm constructs a 3D Gaussian Splatting (3DGS) representation from three images, virtually explores this to find an optimal camera pose, then moves to this pose; the actor arm then executes a policy using the observer's observations. This formulation enhances the clarity and visibility of both the object and the gripper in the policy's observations. As a result, we enable the training of ambidextrous policies on observations that remain closer to the occlusion-free training distribution, leading to more robust policies. We study this formulation with two existing imitation learning methods -- trajectory transfer and behavior cloning -- and experiments show that ObAct significantly outperforms static-camera setups: trajectory transfer improves by 145% without occlusion and 233% with occlusion, while behavior cloning improves by 75% and 143%, respectively. Videos are available at https://obact.github.io.

Paper Structure

This paper contains 12 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Active vision for imitation learning in a mug-handle pickup task across five scenarios. When a static camera struggles (top row), alternative placements (bottom row, and coloured frustums in top row) provide better observations. In our method, at test time an observer robot (robot on right in above examples) computes and moves to such an optimal view from its wrist-cam, after which an actor robot (robot on left in above examples) performs the task conditioned on this view.
  • Figure 2: Framework Overview.(1) Train: The operator selects a demonstration optimal view, moves the observer arm to this view, and records a demonstration. This process is repeated as required by the imitation learning method. (2) Test: The robots explore six views of the scene to construct a 3DGS representation. View optimization within this representation identifies the test-time optimal view. The observer arm then moves to this view, after which the actor arm executes the task.
  • Figure 3: Images of Optimal Views. Top row: demonstration optimal views. Middle row: test-time optimal views in 3DGS with gripper mask overlay. Bottom row: real world test-time optimal views. Red boxes indicate the task-relevant object parts. Test-time optimal views are derived by reconstructing the demonstration’s optimal viewpoints subject to minimal occlusion.
  • Figure 4: Data Efficiency of Behavior Cloning with Active Vision. With the same number of demonstrations, our method outperforms the static camera setup across three tasks, evaluated using 30, 50, and 70 demonstrations.
  • Figure 5: Effect of Number of Exploration Views.