Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

Yang Zhang; Chenwei Wang; Ouyang Lu; Yuan Zhao; Yunfei Ge; Zhenglong Sun; Xiu Li; Chi Zhang; Chenjia Bai; Xuelong Li

Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

Yang Zhang, Chenwei Wang, Ouyang Lu, Yuan Zhao, Yunfei Ge, Zhenglong Sun, Xiu Li, Chi Zhang, Chenjia Bai, Xuelong Li

TL;DR

This work addresses the challenge of adapting Vision-Language-Action models to new robot embodiments and tasks under limited data. It introduces Align-Then-stEer (ATE), a two-stage approach that first creates a unified action latent space by training two Info-VAEs and embedding adaptation actions into modes of the pre-training latent distribution via reverse KL, then employs classifier-guided latent guidance to steer diffusion- or flow-based VLAs during fine-tuning. The framework is plug-and-play, preserving VLA architectures while delivering substantial gains in cross-embodiment and cross-task manipulation on both simulated benchmarks (RoboTwin 1.0, ManiSkill3) and real dual-arm robots (RealMan), including improved convergence speed and robustness to perturbations. By constraining adaptation within a structured latent manifold and guiding outputs toward the target distribution, ATE preserves valuable visuomotor priors while enabling rapid deployment to new platforms and tasks, representing a practical step toward scalable, generalist robot policies.

Abstract

Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce \textbf{Align-Then-stEer (\texttt{ATE})}, a novel, data-efficient, and plug-and-play adaptation framework. \texttt{ATE} first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to \textbf{9.8\%} in simulation and achieves a striking \textbf{32\% success rate gain} in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.

Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

TL;DR

Abstract

Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)