View-Invariant Policy Learning via Zero-Shot Novel View Synthesis
Stephen Tian, Blake Wulfe, Kyle Sargent, Katherine Liu, Sergey Zakharov, Vitor Guizilini, Jiajun Wu
TL;DR
The paper addresses the generalization gap of visuomotor policies to novel camera viewpoints by introducing View Synthesis Augmentation (VISTA), which uses single-image diffusion-based novel view synthesis to generate alternative viewpoints during training. By replacing frames in offline demonstrations with synthesized views from potentially unseen perspectives, the method trains policies that are robust to viewpoint changes without requiring depth data or camera calibration, and without altering inference. The key contribution is demonstrating that ZeroNVS, especially when finetuned on robotic data (MimicGen or DROID), improves robustness to out-of-distribution viewpoints in both simulated and real robotics tasks, with wrist-camera cues offering complementary gains. This approach provides a scalable, data-efficient path to cross-embodiment generalization in visuomotor learning by leveraging large-scale visual priors learned by diffusion models.
Abstract
Large-scale visuomotor policy learning is a promising approach toward developing generalizable manipulation systems. Yet, policies that can be deployed on diverse embodiments, environments, and observational modalities remain elusive. In this work, we investigate how knowledge from large-scale visual data of the world may be used to address one axis of variation for generalizable manipulation: observational viewpoint. Specifically, we study single-image novel view synthesis models, which learn 3D-aware scene-level priors by rendering images of the same scene from alternate camera viewpoints given a single input image. For practical application to diverse robotic data, these models must operate zero-shot, performing view synthesis on unseen tasks and environments. We empirically analyze view synthesis models within a simple data-augmentation scheme that we call View Synthesis Augmentation (VISTA) to understand their capabilities for learning viewpoint-invariant policies from single-viewpoint demonstration data. Upon evaluating the robustness of policies trained with our method to out-of-distribution camera viewpoints, we find that they outperform baselines in both simulated and real-world manipulation tasks. Videos and additional visualizations are available at https://s-tian.github.io/projects/vista.
