Table of Contents
Fetching ...

Observe Then Act: Asynchronous Active Vision-Action Model for Robotic Manipulation

Guokang Wang, Hang Li, Shuyuan Zhang, Di Guo, Yanhong Liu, Huaping Liu

TL;DR

This work tackles robotic manipulation under occlusion by introducing a task-driven asynchronous active vision-action model that decouples sensing and acting through NBV and NBP policies. The approach leverages viewpoint-centric voxel alignment, viewpoint-aware demo augmentation, and a pair of task-agnostic auxiliary rewards to robustly learn sensor-motor coordination with few demonstrations. Across eight RLBench tasks, the method outperforms passive and fixed-view baselines, particularly in occluded conditions, highlighting the practical value of active viewpoint manipulation for manipulation tasks. The contributions advance how robots acquire and leverage visual information to guide actions, enabling more reliable manipulation in real-world, vision-constrained settings.

Abstract

In real-world scenarios, many robotic manipulation tasks are hindered by occlusions and limited fields of view, posing significant challenges for passive observation-based models that rely on fixed or wrist-mounted cameras. In this paper, we investigate the problem of robotic manipulation under limited visual observation and propose a task-driven asynchronous active vision-action model.Our model serially connects a camera Next-Best-View (NBV) policy with a gripper Next-Best Pose (NBP) policy, and trains them in a sensor-motor coordination framework using few-shot reinforcement learning. This approach allows the agent to adjust a third-person camera to actively observe the environment based on the task goal, and subsequently infer the appropriate manipulation actions.We trained and evaluated our model on 8 viewpoint-constrained tasks in RLBench. The results demonstrate that our model consistently outperforms baseline algorithms, showcasing its effectiveness in handling visual constraints in manipulation tasks.

Observe Then Act: Asynchronous Active Vision-Action Model for Robotic Manipulation

TL;DR

This work tackles robotic manipulation under occlusion by introducing a task-driven asynchronous active vision-action model that decouples sensing and acting through NBV and NBP policies. The approach leverages viewpoint-centric voxel alignment, viewpoint-aware demo augmentation, and a pair of task-agnostic auxiliary rewards to robustly learn sensor-motor coordination with few demonstrations. Across eight RLBench tasks, the method outperforms passive and fixed-view baselines, particularly in occluded conditions, highlighting the practical value of active viewpoint manipulation for manipulation tasks. The contributions advance how robots acquire and leverage visual information to guide actions, enabling more reliable manipulation in real-world, vision-constrained settings.

Abstract

In real-world scenarios, many robotic manipulation tasks are hindered by occlusions and limited fields of view, posing significant challenges for passive observation-based models that rely on fixed or wrist-mounted cameras. In this paper, we investigate the problem of robotic manipulation under limited visual observation and propose a task-driven asynchronous active vision-action model.Our model serially connects a camera Next-Best-View (NBV) policy with a gripper Next-Best Pose (NBP) policy, and trains them in a sensor-motor coordination framework using few-shot reinforcement learning. This approach allows the agent to adjust a third-person camera to actively observe the environment based on the task goal, and subsequently infer the appropriate manipulation actions.We trained and evaluated our model on 8 viewpoint-constrained tasks in RLBench. The results demonstrate that our model consistently outperforms baseline algorithms, showcasing its effectiveness in handling visual constraints in manipulation tasks.
Paper Structure (11 sections, 10 equations, 8 figures, 5 tables)

This paper contains 11 sections, 10 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Asynchronous active vision manipulation. In the open_drawer taskjames2020rlbench, the initial viewpoint may not effectively capture the drawer handle due to occlusion. Our model first predicts an optimal viewpoint for better observation of the handle and then determines the gripper action based on this updated view.
  • Figure 2: Decision process comparison. We divide single step interaction into NBV for viewpoint and NBP for action selection. Reward $r_\text{t'}$ is derived from $r_\text{t+1}$ after NBP interaction, enabling sensor-motor joint training through shared task rewards.
  • Figure 3: Illustration of model pipeline and action spaces. The NBV policy infers the 3D ROI position $\mathbf{r}^*$ based on the global scene observation and predict the optimal viewpoint $\mathbf{v}^*$ for the ROI observation according to the given task goal. The NBP agent determines the gripper actions based on the ROI observation from $\mathbf{v}^*$.
  • Figure 4: Viewpoint-centric voxel alignment. Due to the lack of spatial rotation invariance in 3D-CNNs, voxelizing the scene reference to world frame $\{\text{W}\}$ can cause similar observation from different viewpoints to appear distant in the feature space, increasing observation sample variance.
  • Figure 5: Demo trajectory augmentation. The raw transition $\text{T}$ only contains observations and actions at key moments, while the additional augmentative transitions $\text{T}_\text{a}$ are constructed from intermediate observations and actions.
  • ...and 3 more figures