Table of Contents
Fetching ...

Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation

Yongjie Bai, Zhouxia Wang, Yang Liu, Kaijun Luo, Yifan Wen, Mingtong Dai, Weixing Chen, Ziliang Chen, Lingbo Liu, Guanbin Li, Liang Lin

TL;DR

This work tackles robustness and generalization in multi-task robotic manipulation by addressing two core bottlenecks: limited 3D perception from fixed viewpoints and task interference from shared visual encoders. It introduces TVVE, a framework that couples a Multi-Viewpoint Exploration Policy (MVEP) for dynamic, informative viewpoint rendering with a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to learn task-specific representations. The training protocol combines a fixed-view pretraining stage, offline PPO-based MVEP refinement via a pseudo-environment, and a final joint fine-tuning stage, enabling efficient learning without extensive real-world interaction. Empirical results on RLBench, the RLBench-OG benchmark, and real-robot experiments demonstrate substantial gains in accuracy and robustness, including under occlusions and visual perturbations, validating the effectiveness of task-aware viewpoint exploration and modular perception for generalization. The work also provides comprehensive ablations and analyses of routing behavior, generalization to unseen tasks, and practical considerations for real-world deployment.

Abstract

Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-aware Virtual View Exploration (TVVE), a framework designed to overcome these challenges by integrating virtual view exploration with task-specific representation learning. TVVE employs an efficient exploration policy, accelerated by a novel pseudo-environment, to acquire informative views. Furthermore, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TVVE generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. To further validate the robustness and generalization capability of TVVE under out-of-distribution (OOD) settings, we construct a challenging benchmark, RLBench-OG, covering various visual perturbations and camera pose variations. Extensive experiments on RLBench and RLBench-OG show that our TVVE achieves superior performance over state-of-the-art approaches. In real-robot experiments, TVVE demonstrates exceptional performance and generalizes robustly in multiple OOD settings, including visual disturbances and unseen instructions. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.

Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation

TL;DR

This work tackles robustness and generalization in multi-task robotic manipulation by addressing two core bottlenecks: limited 3D perception from fixed viewpoints and task interference from shared visual encoders. It introduces TVVE, a framework that couples a Multi-Viewpoint Exploration Policy (MVEP) for dynamic, informative viewpoint rendering with a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to learn task-specific representations. The training protocol combines a fixed-view pretraining stage, offline PPO-based MVEP refinement via a pseudo-environment, and a final joint fine-tuning stage, enabling efficient learning without extensive real-world interaction. Empirical results on RLBench, the RLBench-OG benchmark, and real-robot experiments demonstrate substantial gains in accuracy and robustness, including under occlusions and visual perturbations, validating the effectiveness of task-aware viewpoint exploration and modular perception for generalization. The work also provides comprehensive ablations and analyses of routing behavior, generalization to unseen tasks, and practical considerations for real-world deployment.

Abstract

Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-aware Virtual View Exploration (TVVE), a framework designed to overcome these challenges by integrating virtual view exploration with task-specific representation learning. TVVE employs an efficient exploration policy, accelerated by a novel pseudo-environment, to acquire informative views. Furthermore, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TVVE generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. To further validate the robustness and generalization capability of TVVE under out-of-distribution (OOD) settings, we construct a challenging benchmark, RLBench-OG, covering various visual perturbations and camera pose variations. Extensive experiments on RLBench and RLBench-OG show that our TVVE achieves superior performance over state-of-the-art approaches. In real-robot experiments, TVVE demonstrates exceptional performance and generalizes robustly in multiple OOD settings, including visual disturbances and unseen instructions. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.

Paper Structure

This paper contains 21 sections, 16 equations, 15 figures, 16 tables, 1 algorithm.

Figures (15)

  • Figure 1: Motivation Illustration. Observations captured from fixed cameras often miss parts of the target objects. For example, the front view only captures the cupboard (highlighted with a red circles), while the left and right shoulder views only show the sugar (already grasped by the end-effector and highlighted with green circles). These incomplete observations may lead to failed operations. In contrast, our proposed TVVE is designed to dynamically explore and re-render informative viewpoints that maximize coverage of target-relevant information, thereby improving the reliability of manipulation outcomes.
  • Figure 2: The overview of our Task-aware Virtual View Exploration (TVVE) framework. The input of this framework is multiple RGB-D images from fixed viewpoints. First, it converts them into point clouds and aggregates them into a global point cloud in the world coordinate system. Then, it diverges into two branches: One branch (orange) performs Coarse Grounding to predict the approximate position of the end-effector. Subsequently, it moves the center of the global point cloud to this predicted position, performs scaling and cropping, retaining the important point cloud region. Another branch (green) receives the global point cloud, passes it through MVEP to predict the optimal camera parameters for the observation viewpoint. Then, using these parameters, it renders a 2D image from the point cloud processed by the red branch. This rendered image is fed into the Fine Grounding to predict the final robot action, including the end-effector position, rotation, gripper status, and collision state.
  • Figure 3: Pipeline of the TaskMoE. Our proposed TaskMoE takes Task ID, Instruction, and Vision as inputs to guide expert selection for task-specific visual representation learning. To improve scalability and generalization, we design a compact gating mechanism with $N_G$ gates shared across $N_J$ tasks ($N_G < N_J$). This design allows tasks with similar action patterns (e.g., Task 1 and Task 2) to share the same gate, while assigning distinct gates to semantically diverse tasks (e.g., Task 3), thereby enabling effective feature specialization across a variety of manipulation tasks.
  • Figure 4: Visualization of our TVVE in the simulation RLBench Environment and Diffusion Policy in the Real-world Environment. (a) In RLBench, we visualize the re-rendering results of dynamic multi-view in the scenes for the Close Jar and Insert Peg tasks, where EEF in the figures denotes the end-effector. Visualizations of additional tasks are provided in the supplementary material. (b) Sample of Inser peg. (c) Sample of Pick Grape in the real world.
  • Figure 5: Real-World Environment Setup.
  • ...and 10 more figures