Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning
Tianchong Jiang, Jingtian Ji, Xiangshan Tan, Jiading Fang, Anand Bhattad, Vitor Guizilini, Matthew R. Walter
TL;DR
The paper tackles view-invariant imitation learning for robotic manipulation by conditioning RGB-based policies on explicit camera extrinsics using per-pixel Plücker ray maps. It presents a practical conditioning framework that accommodates both pretrained and non-pretrained encoders, and introduces six benchmarks across RoboSuite and ManiSkill to probe viewpoint generalization. Across ACT, Diffusion Policy, and SmolVLA, camera conditioning yields consistent improvements in both simulated and real-world settings, reducing reliance on static background cues. The work provides actionable benchmarks, ablation insights on action spaces and encoding strategies, and discusses limitations due to pose estimation errors and avenues for future cross-camera intrinsics generalization.
Abstract
We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Plucker embeddings of per-pixel rays, we show that conditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, including ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in RoboSuite and ManiSkill that pair "fixed" and "randomized" scene variants, decoupling background cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes; this shortcut collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code at https://ripl.github.io/know_your_camera/ .
