Table of Contents
Fetching ...

Perception Stitching: Zero-Shot Perception Encoder Transfer for Visuomotor Robot Policies

Pingcheng Jian, Easop Lee, Zachary Bell, Michael M. Zavlanos, Boyuan Chen

TL;DR

The paper proposes Perception Stitching (PeS), a modular approach for zero-shot transfer of visuomotor policies across different visual configurations by reusing perception encoders. It introduces latent-space alignment via relative representations anchored to exemplar images and enforces disentanglement to stabilize cross-encoder transfer, achieving strong zero-shot performance in both simulation and real-worldrobot manipulation tasks. Key contributions include a practical two-encoder policy decomposition, anchor-based latent alignment, and comprehensive analyses (latent-space visuals and Grad-CAM) to elucidate why perceptual modularity improves transfer. The work enables plug-and-play reuse of perception modules, reducing data collection requirements for new camera setups and facilitating robust real-world deployment of visuomotor policies across diverse sensing configurations.

Abstract

Vision-based imitation learning has shown promising capabilities of endowing robots with various motion skills given visual observation. However, current visuomotor policies fail to adapt to drastic changes in their visual observations. We present Perception Stitching that enables strong zero-shot adaptation to large visual changes by directly stitching novel combinations of visual encoders. Our key idea is to enforce modularity of visual encoders by aligning the latent visual features among different visuomotor policies. Our method disentangles the perceptual knowledge with the downstream motion skills and allows the reuse of the visual encoders by directly stitching them to a policy network trained with partially different visual conditions. We evaluate our method in various simulated and real-world manipulation tasks. While baseline methods failed at all attempts, our method could achieve zero-shot success in real-world visuomotor tasks. Our quantitative and qualitative analysis of the learned features of the policy network provides more insights into the high performance of our proposed method.

Perception Stitching: Zero-Shot Perception Encoder Transfer for Visuomotor Robot Policies

TL;DR

The paper proposes Perception Stitching (PeS), a modular approach for zero-shot transfer of visuomotor policies across different visual configurations by reusing perception encoders. It introduces latent-space alignment via relative representations anchored to exemplar images and enforces disentanglement to stabilize cross-encoder transfer, achieving strong zero-shot performance in both simulation and real-worldrobot manipulation tasks. Key contributions include a practical two-encoder policy decomposition, anchor-based latent alignment, and comprehensive analyses (latent-space visuals and Grad-CAM) to elucidate why perceptual modularity improves transfer. The work enables plug-and-play reuse of perception modules, reducing data collection requirements for new camera setups and facilitating robust real-world deployment of visuomotor policies across diverse sensing configurations.

Abstract

Vision-based imitation learning has shown promising capabilities of endowing robots with various motion skills given visual observation. However, current visuomotor policies fail to adapt to drastic changes in their visual observations. We present Perception Stitching that enables strong zero-shot adaptation to large visual changes by directly stitching novel combinations of visual encoders. Our key idea is to enforce modularity of visual encoders by aligning the latent visual features among different visuomotor policies. Our method disentangles the perceptual knowledge with the downstream motion skills and allows the reuse of the visual encoders by directly stitching them to a policy network trained with partially different visual conditions. We evaluate our method in various simulated and real-world manipulation tasks. While baseline methods failed at all attempts, our method could achieve zero-shot success in real-world visuomotor tasks. Our quantitative and qualitative analysis of the learned features of the policy network provides more insights into the high performance of our proposed method.
Paper Structure (22 sections, 12 equations, 11 figures, 14 tables, 1 algorithm)

This paper contains 22 sections, 12 equations, 11 figures, 14 tables, 1 algorithm.

Figures (11)

  • Figure 1: Perception Stitching: "Policy A" was trained with an in-hand camera and a front-view camera. "Policy B" was trained with a close-up camera and a side-view camera. Perception Stitching enables zero-shot stitching of the original Policy A and B by reusing their relevant components for each sensing configuration to form a "Policy C". "Policy C" can maintain strong zero-shot transfer performance with an in-hand camera and a side-view camera.
  • Figure 2: Method Overview. Two visual encoders process the RGB images from two cameras separately, and the latent representations are concatenated with the proprioception of the robot end effector state. The original latent representations of the images are observed to have an approximate isometric transformation relationship. Relative representations with disentanglement regularization can maintain an approximate invariance and, therefore, help achieve high zero-shot transfer performance.
  • Figure 3: Anchors Selection. Select the anchor images in one dataset with the k-means algorithm hartigan1979algorithm. Replay the trajectories of the first dataset to collect another dataset with a different camera. We select the images with the corresponding indices of the anchors in the first dataset as the anchors in the second dataset.
  • Figure 4: Simulation Experiment Setup. (a) Five simulation tasks from Robomimic mandlekar2021matters benchmark. (b) Seven camera configuration variations include. (c) Four camera mounting positions.
  • Figure 5: Perception Stitching with Single Camera. The original two policies are trained with only one camera. The other visual encoder takes in a black image. The corresponding anchors are selected with our proposed method \ref{['fig:anchor_selection']}. The visual encoder of the black image uses the same anchor images as the other encoder of this policy.
  • ...and 6 more figures