Table of Contents
Fetching ...

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

Mengzhen Liu, Enshen Zhou, Cheng Chi, Yi Han, Shanyu Rong, Liming Chen, Pengwei Wang, Zhongyuan Wang, Shanghang Zhang

Abstract

Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and \(π_0\), achieving up to 31.25\% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe

SaPaVe: Towards Active Perception and Manipulation in Vision-Language-Action Models for Robotics

Abstract

Active perception and manipulation are crucial for robots to interact with complex scenes. Existing methods struggle to unify semantic-driven active perception with robust, viewpoint-invariant execution. We propose SaPaVe, an end-to-end framework that jointly learns these capabilities in a data-efficient manner. Our approach decouples camera and manipulation actions rather than placing them in a shared action space, and follows a bottom-up training strategy: we first train semantic camera control on a large-scale dataset, then jointly optimize both action types using hybrid data. To support this framework, we introduce ActiveViewPose-200K, a dataset of 200k image-language-camera movement pairs for semantic camera movement learning, and a 3D geometry-aware module that improves execution robustness under dynamic viewpoints. We also present ActiveManip-Bench, the first benchmark for evaluating active manipulation beyond fixed-view settings. Extensive experiments in both simulation and real-world environments show that SaPaVe outperforms recent vision-language-action models such as GR00T N1 and , achieving up to 31.25\% higher success rates in real-world tasks. These results show that tightly coupled perception and execution, when trained with decoupled yet coordinated strategies, enable efficient and generalizable active manipulation. Project page: https://lmzpai.github.io/SaPaVe
Paper Structure (48 sections, 6 equations, 15 figures, 6 tables)

This paper contains 48 sections, 6 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: We propose SaPaVe, an end-to-end active manipulation framework that jointly integrates semantic active perception and active-view execution; the former selectively shifting viewpoints to reveal task-critical cues in cluttered scenes, while the latter grounds newly acquired observations into immediate actions, enabling success even from suboptimal views. (a) For instance, grasping the white bowl in $\mathcal{O}_1$ requires rotating the egocentric view, as both fixed ego view and third-person view are occluded. In contrast, targeting the range hood handle ($\mathcal{O}_5$) only needs a brief upward shift, since precise centering is unnecessary and awkward to reach. (b) To address the limitations of fixed-view benchmarks and the cost of real-world trials, we introduce ActiveManip-Bench, a richly annotated benchmark spanning 12 tasks, 100 objects, and 20 diverse scenes. (c) On this benchmark, SaPaVe outperforms all baselines with an average success rate of 75.2%.
  • Figure 2: Overview of SaPaVe. SaPaVe can process RGB images and task instructions and output camera movement and manipulation actions in a decoupled action space. This decoupled design enables the model to achieve active manipulation via a bottom-up, two-stage training strategy: First, large-scale embodiment-agnostic camera control data fosters semantic active perception, which is encoded as prior knowledge in a camera adapter. Second, mixed data together with Universal Spatial Knowledge Injection flexibly incorporate various geometric configurations (e.g., absolute depth, camera intrinsics), thereby enhancing spatial precision for active-view execution.
  • Figure 3: Overview of ActiveViewPose-200K. It is a high-quality dataset comprising 200k image-language and camera movement pairs, enriched with highly detailed semantic annotations to enable semantic camera movement learning.
  • Figure 4: Overview of ActiveManip-Bench: It is the first simulation benchmark to evaluate active manipulation beyond traditional fixed-view settings. ActiveManip-Bench features 12 richly annotated tasks across 100 objects and 20 diverse scenes.
  • Figure 5: Real-world Execution roll-outs (ego & third view).
  • ...and 10 more figures