Table of Contents
Fetching ...

Viewpoint-Agnostic Manipulation Policies with Strategic Vantage Selection

Sreevishakh Vasudevan, Som Sagar, Ransalu Senanayake

TL;DR

The paper tackles the brittleness of vision-guided manipulation policies to camera viewpoint changes. It introduces Vantage, a viewpoint-selection framework that uses Bayesian optimization with a Gaussian-process surrogate to pick a small set of informative training viewpoints for fine-tuning. The method provides sublinear regret and robustness guarantees, with large empirical gains across simulated and real-world tasks and policy families, including diffusion policies. Real-robot experiments confirm sim-to-real viability and demonstrate substantial performance improvements with limited fine-tuning budget.

Abstract

Since vision-based manipulation policies are typically trained from data gathered from a single viewpoint, their performance drops when the view changes during deployment. Naively aggregating demonstrations from numerous random views is not only costly but also known to destabilize learning, as excessive visual diversity acts as noise. We present Vantage, a viewpoint selection framework to fine-tune any pre-trained policy on a small, strategically set of camera poses to induce viewpoint-agnostic behavior. Instead of relying on costly brute-force search over viewpoints, Vantage formulates camera placement as an information gain optimization problem in a continuous space. This approach balances exploration of novel poses with exploitation of promising ones, while also providing theoretical guarantees about convergence and robustness. Across manipulation tasks and policy families, Vantage consistently improves success under viewpoint shifts compared to fixed, grid, or random data selection strategies with only a handful of fine-tuning steps. Experiments conducted on simulated and real-world setups show that Vantage increases the task success rate by 25% for diffusion policies, and yields robust gains in dynamic-camera settings.

Viewpoint-Agnostic Manipulation Policies with Strategic Vantage Selection

TL;DR

The paper tackles the brittleness of vision-guided manipulation policies to camera viewpoint changes. It introduces Vantage, a viewpoint-selection framework that uses Bayesian optimization with a Gaussian-process surrogate to pick a small set of informative training viewpoints for fine-tuning. The method provides sublinear regret and robustness guarantees, with large empirical gains across simulated and real-world tasks and policy families, including diffusion policies. Real-robot experiments confirm sim-to-real viability and demonstrate substantial performance improvements with limited fine-tuning budget.

Abstract

Since vision-based manipulation policies are typically trained from data gathered from a single viewpoint, their performance drops when the view changes during deployment. Naively aggregating demonstrations from numerous random views is not only costly but also known to destabilize learning, as excessive visual diversity acts as noise. We present Vantage, a viewpoint selection framework to fine-tune any pre-trained policy on a small, strategically set of camera poses to induce viewpoint-agnostic behavior. Instead of relying on costly brute-force search over viewpoints, Vantage formulates camera placement as an information gain optimization problem in a continuous space. This approach balances exploration of novel poses with exploitation of promising ones, while also providing theoretical guarantees about convergence and robustness. Across manipulation tasks and policy families, Vantage consistently improves success under viewpoint shifts compared to fixed, grid, or random data selection strategies with only a handful of fine-tuning steps. Experiments conducted on simulated and real-world setups show that Vantage increases the task success rate by 25% for diffusion policies, and yields robust gains in dynamic-camera settings.

Paper Structure

This paper contains 19 sections, 6 theorems, 28 equations, 8 figures, 2 tables, 1 algorithm.

Key Result

Theorem III.1

Let $f:\Theta \to [0,1]$ denote the mapping from training viewpoints to average success rates, drawn from a GP prior with kernel $k$. Assume observations $y_t = f(\theta_t)+\varepsilon_t$ with $\varepsilon_t$ sub-Gaussian. Running $q$-UCB for $T$ rounds yields cumulative regret with probability at least $1-\delta$, where $\delta$ is a user chosen failure probability, and $\theta^*=\arg\max_{\thet

Figures (8)

  • Figure 1: (a) In contemporary and future robotic platforms, cameras are frequently mounted on moving bodies or joints, causing dynamic changes in viewpoints. (b) Illustration of selecting camera viewpoints when training manipulation policies so that they become agnostic to camera location at test time. Randomly selecting viewpoints deteriorates performance. Consider a setup where a camera can be placed anywhere in the $x\!-\!y\!-\!z$ space to observe the manipulator and its surrounding. Colored regions indicate where the camera can be placed for each policy for good performance. Pre-trained policies work only when a camera is placed in the narrow region (blue) of high accuracy, where it was originally trained. Standard fine-tuning (yellow), which relies on samples (i.e., collecting demonstrations) collected from randomly or uniformly placed camera viewpoints, spreads demonstration collection and fine-tuning budget across many uninformative regions. Such samples with occlusion or errors in depth perception (gray) can even hinder learning performance. Vantage, in contrast, strategically selects a small number of informative viewpoints (green), targeting areas that maximize downstream task performance. This allows vantage-fine-tuned policies to perform well even in dynamic camera settings, described in (a) and Section \ref{['sec:intro']}.
  • Figure 2: Overview of the Vantage framework. Starting from a pre-trained manipulation policy and an initial camera viewpoint, the system iteratively selects additional viewpoints for fine-tuning. After each Bayesian optimization step (BO step), the updated policy is evaluated across the observation space, and the performance signal is used to guide the next selection. By progressively incorporating strategically chosen views, the policy becomes increasingly robust to viewpoint shifts, converging to a final model that generalizes across diverse views.
  • Figure 3: Success rate across observation space (i.e., the cradle of a hemisphere where the camera can be moved at deployment). The top row shows performance under the default pretraining viewpoint versus a novel test viewpoint. The bottom row compares the same hemisphere after Vantage fine-tuning. While the pre-trained policy works around only where it was trained, Vantage-fine-tuned model works almost everywhere.
  • Figure 4: Experiment setups for Vantage: (First) external camera placement with Unitree D1 arm, (second, third, fourth) RoboSuite environments for benchmark tasks.
  • Figure 5: Vantage iterations for viewpoint selection on Pick & Place. Each point corresponds to a candidate camera position evaluated during fine-tuning, with color and size indicating success rate and iteration, respectively. The purple square marks the default training viewpoint, while the yellow star denotes the globally best-performing vantage point. The search progressively concentrates around informative regions, converging to the optimum by iteration 7.
  • ...and 3 more figures

Theorems & Definitions (12)

  • Theorem III.1: Efficiency of viewpoint selection
  • proof : Proof sketch
  • Theorem III.2: Success rate convergence
  • proof : Proof sketch
  • Theorem III.3: Robustness under camera placement error
  • proof : Proof sketch
  • Theorem 1.1: GP-UCB Regret Bound
  • proof
  • Theorem 1.2: Average Success Convergence
  • proof
  • ...and 2 more