Table of Contents
Fetching ...

Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning

Tianchong Jiang, Jingtian Ji, Xiangshan Tan, Jiading Fang, Anand Bhattad, Vitor Guizilini, Matthew R. Walter

TL;DR

The paper tackles view-invariant imitation learning for robotic manipulation by conditioning RGB-based policies on explicit camera extrinsics using per-pixel Plücker ray maps. It presents a practical conditioning framework that accommodates both pretrained and non-pretrained encoders, and introduces six benchmarks across RoboSuite and ManiSkill to probe viewpoint generalization. Across ACT, Diffusion Policy, and SmolVLA, camera conditioning yields consistent improvements in both simulated and real-world settings, reducing reliance on static background cues. The work provides actionable benchmarks, ablation insights on action spaces and encoding strategies, and discusses limitations due to pose estimation errors and avenues for future cross-camera intrinsics generalization.

Abstract

We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Plucker embeddings of per-pixel rays, we show that conditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, including ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in RoboSuite and ManiSkill that pair "fixed" and "randomized" scene variants, decoupling background cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes; this shortcut collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code at https://ripl.github.io/know_your_camera/ .

Do You Know Where Your Camera Is? View-Invariant Policy Learning with Camera Conditioning

TL;DR

The paper tackles view-invariant imitation learning for robotic manipulation by conditioning RGB-based policies on explicit camera extrinsics using per-pixel Plücker ray maps. It presents a practical conditioning framework that accommodates both pretrained and non-pretrained encoders, and introduces six benchmarks across RoboSuite and ManiSkill to probe viewpoint generalization. Across ACT, Diffusion Policy, and SmolVLA, camera conditioning yields consistent improvements in both simulated and real-world settings, reducing reliance on static background cues. The work provides actionable benchmarks, ablation insights on action spaces and encoding strategies, and discusses limitations due to pose estimation errors and avenues for future cross-camera intrinsics generalization.

Abstract

We study view-invariant imitation learning by explicitly conditioning policies on camera extrinsics. Using Plucker embeddings of per-pixel rays, we show that conditioning on extrinsics significantly improves generalization across viewpoints for standard behavior cloning policies, including ACT, Diffusion Policy, and SmolVLA. To evaluate policy robustness under realistic viewpoint shifts, we introduce six manipulation tasks in RoboSuite and ManiSkill that pair "fixed" and "randomized" scene variants, decoupling background cues from camera pose. Our analysis reveals that policies without extrinsics often infer camera pose using visual cues from static backgrounds in fixed scenes; this shortcut collapses when workspace geometry or camera placement shifts. Conditioning on extrinsics restores performance and yields robust RGB-only control without depth. We release the tasks, demonstrations, and code at https://ripl.github.io/know_your_camera/ .

Paper Structure

This paper contains 23 sections, 1 equation, 13 figures, 1 table.

Figures (13)

  • Figure 1: Visualization of camera poses in the real-robot experiment. Training cameras are visualized in green, and test cameras are visualized in red.
  • Figure 2: We propose two ways by which to encode Plücker ray-maps for policies \ref{['fig:arch-with-pretrained']} with and \ref{['fig:arch-without-pretrained']} without a pretrained encoder. The $\bigoplus$ sign indicates channel-wise concatenation.
  • Figure 3: Six custom tasks. The top row is the fixed setups and the bottom row is the randomized setups. The left three are in RoboSuite, and the right three are in ManiSkill. Each sub-figure overlays three images with different initialization seeds to illustrate variations in the environments.
  • Figure 4: Visualization of changes of two factors in data collection: camera pose and initial state of environment. On the left, $n = 3$ and $m = 1$. On the right, $n = 4$ and $m = 2$.
  • Figure 5: Visualization of camera poses. The training camera poses are in green and the test camera poses are in red.
  • ...and 8 more figures