Table of Contents
Fetching ...

View-Invariant Policy Learning via Zero-Shot Novel View Synthesis

Stephen Tian, Blake Wulfe, Kyle Sargent, Katherine Liu, Sergey Zakharov, Vitor Guizilini, Jiajun Wu

TL;DR

The paper addresses the generalization gap of visuomotor policies to novel camera viewpoints by introducing View Synthesis Augmentation (VISTA), which uses single-image diffusion-based novel view synthesis to generate alternative viewpoints during training. By replacing frames in offline demonstrations with synthesized views from potentially unseen perspectives, the method trains policies that are robust to viewpoint changes without requiring depth data or camera calibration, and without altering inference. The key contribution is demonstrating that ZeroNVS, especially when finetuned on robotic data (MimicGen or DROID), improves robustness to out-of-distribution viewpoints in both simulated and real robotics tasks, with wrist-camera cues offering complementary gains. This approach provides a scalable, data-efficient path to cross-embodiment generalization in visuomotor learning by leveraging large-scale visual priors learned by diffusion models.

Abstract

Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at https://s-tian.github.io/projects/vista.

View-Invariant Policy Learning via Zero-Shot Novel View Synthesis

TL;DR

The paper addresses the generalization gap of visuomotor policies to novel camera viewpoints by introducing View Synthesis Augmentation (VISTA), which uses single-image diffusion-based novel view synthesis to generate alternative viewpoints during training. By replacing frames in offline demonstrations with synthesized views from potentially unseen perspectives, the method trains policies that are robust to viewpoint changes without requiring depth data or camera calibration, and without altering inference. The key contribution is demonstrating that ZeroNVS, especially when finetuned on robotic data (MimicGen or DROID), improves robustness to out-of-distribution viewpoints in both simulated and real robotics tasks, with wrist-camera cues offering complementary gains. This approach provides a scalable, data-efficient path to cross-embodiment generalization in visuomotor learning by leveraging large-scale visual priors learned by diffusion models.

Abstract

Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at https://s-tian.github.io/projects/vista.
Paper Structure (29 sections, 12 figures, 5 tables, 1 algorithm)

This paper contains 29 sections, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: We aim to learn policies that generalize to novel viewpoints from widely available, offline single-view RGB robotic trajectory data.
  • Figure 2: Random samples from the two considered evaluation viewpoint ranges.
  • Figure 3: Depiction of the data augmentation scheme that we study. Observations are replaced with viewpoint-augmented versions of the same scene with action labels held constant.
  • Figure 4: Qualitative examples of novel views rendered on robotic tasks. All images are synthesized zero-shot; that is, models have not been previously trained on data from that task. We observe that finetuning on robotic datasets improves image fidelity, particularly for robot appearances.
  • Figure 5: Performance of novel view--augmented policies when provided with additional wrist camera observations, which are consistent between train and test settings. We find as per expectation that wrist observations improve performance across the board, as they are agnostic to third-person viewpoint. These improvements complement those of view augmentation methods.
  • ...and 7 more figures