Learning to See and Act: Task-Aware Virtual View Exploration for Robotic Manipulation
Yongjie Bai, Zhouxia Wang, Yang Liu, Kaijun Luo, Yifan Wen, Mingtong Dai, Weixing Chen, Ziliang Chen, Lingbo Liu, Guanbin Li, Liang Lin
TL;DR
This work tackles robustness and generalization in multi-task robotic manipulation by addressing two core bottlenecks: limited 3D perception from fixed viewpoints and task interference from shared visual encoders. It introduces TVVE, a framework that couples a Multi-Viewpoint Exploration Policy (MVEP) for dynamic, informative viewpoint rendering with a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to learn task-specific representations. The training protocol combines a fixed-view pretraining stage, offline PPO-based MVEP refinement via a pseudo-environment, and a final joint fine-tuning stage, enabling efficient learning without extensive real-world interaction. Empirical results on RLBench, the RLBench-OG benchmark, and real-robot experiments demonstrate substantial gains in accuracy and robustness, including under occlusions and visual perturbations, validating the effectiveness of task-aware viewpoint exploration and modular perception for generalization. The work also provides comprehensive ablations and analyses of routing behavior, generalization to unseen tasks, and practical considerations for real-world deployment.
Abstract
Recent vision-language-action (VLA) models for multi-task robotic manipulation commonly rely on static viewpoints and shared visual encoders, which limit 3D perception and cause task interference, hindering robustness and generalization. In this work, we propose Task-aware Virtual View Exploration (TVVE), a framework designed to overcome these challenges by integrating virtual view exploration with task-specific representation learning. TVVE employs an efficient exploration policy, accelerated by a novel pseudo-environment, to acquire informative views. Furthermore, we introduce a Task-aware Mixture-of-Experts (TaskMoE) visual encoder to disentangle features across different tasks, boosting both representation fidelity and task generalization. By learning to see the world in a task-aware way, TVVE generates more complete and discriminative visual representations, demonstrating significantly enhanced action prediction across a wide array of manipulation challenges. To further validate the robustness and generalization capability of TVVE under out-of-distribution (OOD) settings, we construct a challenging benchmark, RLBench-OG, covering various visual perturbations and camera pose variations. Extensive experiments on RLBench and RLBench-OG show that our TVVE achieves superior performance over state-of-the-art approaches. In real-robot experiments, TVVE demonstrates exceptional performance and generalizes robustly in multiple OOD settings, including visual disturbances and unseen instructions. Visual results and code are provided at: https://hcplab-sysu.github.io/TAVP.
