Contrastive Representation Regularization for Vision-Language-Action Models

Taeyoung Kim; Jimin Lee; Myungkyu Koo; Dongyoung Kim; Kyungmin Lee; Changyeon Kim; Younggyo Seo; Jinwoo Shin

Contrastive Representation Regularization for Vision-Language-Action Models

Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin

TL;DR

This paper tackles the misalignment between pre-trained Vision-Language Model representations and robotic signals in Vision-Language-Action systems by introducing Robot State-aware Contrastive Loss (RS-CL). RS-CL uses a learnable summarization token and a proprioception-informed weighted contrastive objective, complemented by a view-cutoff representation augmentation, and is integrated alongside the standard flow-matching loss to form a lightweight, end-to-end regularization. Empirically, RS-CL yields consistent gains across multiple benchmarks (RoboCasa-Kitchen, LIBERO) and real-world tasks, with notable improvements on precision-critical pick-and-place actions and robust performance across from-scratch and fine-tuning setups. The results demonstrate that incorporating robot-centric signals into representation learning can substantially enhance control-relevant features in VLA models, suggesting a path toward more reliable and versatile robotic manipulation. Future work could extend RS-CL to additional proprioceptive modalities and more complex robotic platforms.

Abstract

Vision-Language-Action (VLA) models have shown its capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive states. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states, by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL effectively enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipeline. Our empirical results demonstrate that RS-CL substantially improves the manipulation performance of state-of-the-art VLA models; it pushes the prior art from 30.8% to 41.5% on pick-and-place tasks in RoboCasa-Kitchen, through more accurate positioning during grasping and placing, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

Contrastive Representation Regularization for Vision-Language-Action Models

TL;DR

Abstract

Contrastive Representation Regularization for Vision-Language-Action Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (12)