Table of Contents
Fetching ...

Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

Junlin Wang, Zhiyun Lin

TL;DR

Problem: learning effective visual representations for robotic manipulation where body dynamics are critical. Approach: ICon applies an inter-token contrastive objective, $\mathcal{L}_{\text{ICon}}$, to ViT token features to separate agent-centric from environment cues, using farthest-point sampling (FPS) to select diverse keys and a multi-level design that weights layers with $\gamma$. This objective is combined with the diffusion-policy prediction loss, controlled by $\lambda$, to enable end-to-end training. Contributions: agent/environment disentanglement across ViT layers, FPS-based diverse key sampling, and demonstrated improvements in policy performance and cross-robot transfer across RLBench and Robosuite. Significance: yields more data-efficient visuomotor learning and practical cross-robot adaptation in manipulation tasks.

Abstract

Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present $\textbf{I}$nter-token $\textbf{Con}$trast ($\textbf{ICon}$), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://github.com/HenryWJL/icon

Grounding Bodily Awareness in Visual Representations for Efficient Policy Learning

TL;DR

Problem: learning effective visual representations for robotic manipulation where body dynamics are critical. Approach: ICon applies an inter-token contrastive objective, , to ViT token features to separate agent-centric from environment cues, using farthest-point sampling (FPS) to select diverse keys and a multi-level design that weights layers with . This objective is combined with the diffusion-policy prediction loss, controlled by , to enable end-to-end training. Contributions: agent/environment disentanglement across ViT layers, FPS-based diverse key sampling, and demonstrated improvements in policy performance and cross-robot transfer across RLBench and Robosuite. Significance: yields more data-efficient visuomotor learning and practical cross-robot adaptation in manipulation tasks.

Abstract

Learning effective visual representations for robotic manipulation remains a fundamental challenge due to the complex body dynamics involved in action execution. In this paper, we study how visual representations that carry body-relevant cues can enable efficient policy learning for downstream robotic manipulation tasks. We present nter-token trast (), a contrastive learning method applied to the token-level representations of Vision Transformers (ViTs). ICon enforces a separation in the feature space between agent-specific and environment-specific tokens, resulting in agent-centric visual representations that embed body-specific inductive biases. This framework can be seamlessly integrated into end-to-end policy learning by incorporating the contrastive loss as an auxiliary objective. Our experiments show that ICon not only improves policy performance across various manipulation tasks but also facilitates policy transfer across different robots. The project website: https://github.com/HenryWJL/icon

Paper Structure

This paper contains 28 sections, 6 equations, 6 figures, 5 tables, 2 algorithms.

Figures (6)

  • Figure 1: Overview of ICon. A full-scene RGB image containing a robotic agent is tokenized and processed by a vision transformer. The resulting token-level features (excluding the [CLS] token) are reshaped and aligned with a token-level mask derived from the agent’s segmentation mask. Tokens corresponding to the agent and the environment are then sampled and used as keys to compute the inter-token contrastive loss.
  • Figure 2: Visualization of point distributions sampled from the agent mask. (a) Random sampling may result in points clustered within a small region. (b) Farthest Point Sampling (FPS) produces points that are well-distributed across the entire agent.
  • Figure 3: Visualization of simulated environments used for evaluation.
  • Figure 4: Comparison of training stability based on maximum and average performance during the training process.
  • Figure 5: Summary of ablation experiments.
  • ...and 1 more figures