Table of Contents
Fetching ...

Efficient Camera Pose Augmentation for View Generalization in Robotic Policy Learning

Sen Wang, Huaiyi Dong, Jingyi Tian, Jiayi Li, Zhuo Yang, Tongtong Cao, Anlin Chen, Shuang Wu, Le Wang, Sanping Zhou

Abstract

Prevailing 2D-centric visuomotor policies exhibit a pronounced deficiency in novel view generalization, as their reliance on static observations hinders consistent action mapping across unseen views. In response, we introduce GenSplat, a feed-forward 3D Gaussian Splatting framework that facilitates view-generalized policy learning through novel view rendering. GenSplat employs a permutation-equivariant architecture to reconstruct high-fidelity 3D scenes from sparse, uncalibrated inputs in a single forward pass. To ensure structural integrity, we design a 3D-prior distillation strategy that regularizes the 3DGS optimization, preventing the geometric collapse typical of purely photometric supervision. By rendering diverse synthetic views from these stable 3D representations, we systematically augment the observational manifold during training. This augmentation forces the policy to ground its decisions in underlying 3D structures, thereby ensuring robust execution under severe spatial perturbations where baselines severely degrade.

Efficient Camera Pose Augmentation for View Generalization in Robotic Policy Learning

Abstract

Prevailing 2D-centric visuomotor policies exhibit a pronounced deficiency in novel view generalization, as their reliance on static observations hinders consistent action mapping across unseen views. In response, we introduce GenSplat, a feed-forward 3D Gaussian Splatting framework that facilitates view-generalized policy learning through novel view rendering. GenSplat employs a permutation-equivariant architecture to reconstruct high-fidelity 3D scenes from sparse, uncalibrated inputs in a single forward pass. To ensure structural integrity, we design a 3D-prior distillation strategy that regularizes the 3DGS optimization, preventing the geometric collapse typical of purely photometric supervision. By rendering diverse synthetic views from these stable 3D representations, we systematically augment the observational manifold during training. This augmentation forces the policy to ground its decisions in underlying 3D structures, thereby ensuring robust execution under severe spatial perturbations where baselines severely degrade.

Paper Structure

This paper contains 14 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: The diagram of the proposed GenSplat. Given expert demonstrations, GenSplat employs a feed-forward 3D reconstruction pipeline to render geometrically consistent novel viewpoints under controlled perturbations, explicitly boosting camera pose diversity. Experiments demonstrate that policies trained on GenSplat-augmented data achieve competitive performance across different perturbation levels, substantially outperforming those trained solely on human-collected demonstrations.
  • Figure 2: Overview of GenSplat. Our feed-forward 3DGS framework reconstructs a 3D scene from sparse, uncalibrated robotic observations. GenSplat employs a permutation-equivariant transformer architecture to predict camera poses, dense point maps, and Gaussian parameters, while leveraging pre-trained visual geometry models for 3D-prior distillation supervision. The reconstructed scenes enable geometrically consistent novel view synthesis to expand the observational manifold, significantly improving the viewpoint generalization of robotic policies.
  • Figure 3: Overview of the experiment setup and tasks. We design six manipulation tasks for real-world evaluation.
  • Figure 4: Policy robustness to camera pose perturbations. We evaluate 30 episodes per task on 6 real-world tasks under increasing camera perturbations and report mean success rates (in %). Policies trained with GenSplat augmented novel views data consistently outperform those trained only on source views for both $\pi_{0}$ and Diffusion Policy.
  • Figure 5: Qualitative 3D geometry reconstruction on the DROID dataset. We visualize pairs of reference targets (ground truth or pseudo-labels) against GenSplat's predicted RGB, depth, and normal maps. GenSplat accurately recovers structural high-frequency details and maintains strict spatial alignment without explicit camera calibration. Extended visual comparisons are provided in Appendix B.
  • ...and 4 more figures