Table of Contents
Fetching ...

Contrastive Representation Regularization for Vision-Language-Action Models

Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin

TL;DR

This paper tackles the misalignment between pre-trained Vision-Language Model representations and robotic signals in Vision-Language-Action systems by introducing Robot State-aware Contrastive Loss (RS-CL). RS-CL uses a learnable summarization token and a proprioception-informed weighted contrastive objective, complemented by a view-cutoff representation augmentation, and is integrated alongside the standard flow-matching loss to form a lightweight, end-to-end regularization. Empirically, RS-CL yields consistent gains across multiple benchmarks (RoboCasa-Kitchen, LIBERO) and real-world tasks, with notable improvements on precision-critical pick-and-place actions and robust performance across from-scratch and fine-tuning setups. The results demonstrate that incorporating robot-centric signals into representation learning can substantially enhance control-relevant features in VLA models, suggesting a path toward more reliable and versatile robotic manipulation. Future work could extend RS-CL to additional proprioceptive modalities and more complex robotic platforms.

Abstract

Vision-Language-Action (VLA) models have shown its capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive states. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states, by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL effectively enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipeline. Our empirical results demonstrate that RS-CL substantially improves the manipulation performance of state-of-the-art VLA models; it pushes the prior art from 30.8% to 41.5% on pick-and-place tasks in RoboCasa-Kitchen, through more accurate positioning during grasping and placing, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

Contrastive Representation Regularization for Vision-Language-Action Models

TL;DR

This paper tackles the misalignment between pre-trained Vision-Language Model representations and robotic signals in Vision-Language-Action systems by introducing Robot State-aware Contrastive Loss (RS-CL). RS-CL uses a learnable summarization token and a proprioception-informed weighted contrastive objective, complemented by a view-cutoff representation augmentation, and is integrated alongside the standard flow-matching loss to form a lightweight, end-to-end regularization. Empirically, RS-CL yields consistent gains across multiple benchmarks (RoboCasa-Kitchen, LIBERO) and real-world tasks, with notable improvements on precision-critical pick-and-place actions and robust performance across from-scratch and fine-tuning setups. The results demonstrate that incorporating robot-centric signals into representation learning can substantially enhance control-relevant features in VLA models, suggesting a path toward more reliable and versatile robotic manipulation. Future work could extend RS-CL to additional proprioceptive modalities and more complex robotic platforms.

Abstract

Vision-Language-Action (VLA) models have shown its capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive states. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states, by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL effectively enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipeline. Our empirical results demonstrate that RS-CL substantially improves the manipulation performance of state-of-the-art VLA models; it pushes the prior art from 30.8% to 41.5% on pick-and-place tasks in RoboCasa-Kitchen, through more accurate positioning during grasping and placing, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

Paper Structure

This paper contains 27 sections, 5 equations, 12 figures, 12 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview. We extend the standard VLA training framework with a contrastive path. Embeddings from the pre-trained VLM are augmented by the view cutoff operation applied on a randomly selected observation view, and optimized with our Robot State-aware Contrastive Loss to attract samples with similar proprioceptive states, complementing the action prediction loss.
  • Figure 2: Training VLM representations for action prediction.(a) We visualize VLM embeddings of robot episodes performing the same task "Open the microwave / cabinet door" across different scenes in RoboCasa-Kitchen. (b) Pre-trained VLM representations are dominated by the visual appearance (e.g., distractor objects). (c) RS-CL guides embeddings to align with the robot’s proprioceptive states, yielding representations that capture common robotic signals (e.g., the robot's current pose, next control action) across environments, therefore aligning all episodes by the task progress.
  • Figure 3: Representation-level augmentation for contrastive pairs.View cutoff is an simple augmentation that randomly masks out the embedding slice of one observation view from the VLM representation.
  • Figure 4: Example of tasks used in our experiments. We study RS-CL on multitask simulation benchmarks of (a) RoboCasa-Kitchen robocasa and (b) LIBERO libero. In addition, we consider (c) real-robot manipulation tasks considering pick-and-place, and a close lid task, utilizing two camera viewpoints.
  • Figure 5: Real-robot task success rate (%). Results on (a) in-domain tasks (4 pick-and-place and 1 close-lid task), and (b) generalization tasks (visual, physical generalization, and language grounding). For the in-domain close-lid and language grounding tasks, we report both partial success (e.g., successful pickup, language following; transparent bars) and full success (solid bars).
  • ...and 7 more figures