Table of Contents
Fetching ...

Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

Yang Zhang, Chenwei Wang, Ouyang Lu, Yuan Zhao, Yunfei Ge, Zhenglong Sun, Xiu Li, Chi Zhang, Chenjia Bai, Xuelong Li

TL;DR

This work addresses the challenge of adapting Vision-Language-Action models to new robot embodiments and tasks under limited data. It introduces Align-Then-stEer (ATE), a two-stage approach that first creates a unified action latent space by training two Info-VAEs and embedding adaptation actions into modes of the pre-training latent distribution via reverse KL, then employs classifier-guided latent guidance to steer diffusion- or flow-based VLAs during fine-tuning. The framework is plug-and-play, preserving VLA architectures while delivering substantial gains in cross-embodiment and cross-task manipulation on both simulated benchmarks (RoboTwin 1.0, ManiSkill3) and real dual-arm robots (RealMan), including improved convergence speed and robustness to perturbations. By constraining adaptation within a structured latent manifold and guiding outputs toward the target distribution, ATE preserves valuable visuomotor priors while enabling rapid deployment to new platforms and tasks, representing a practical step toward scalable, generalist robot policies.

Abstract

Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce \textbf{Align-Then-stEer (\texttt{ATE})}, a novel, data-efficient, and plug-and-play adaptation framework. \texttt{ATE} first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to \textbf{9.8\%} in simulation and achieves a striking \textbf{32\% success rate gain} in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.

Align-Then-stEer: Adapting the Vision-Language Action Models through Unified Latent Guidance

TL;DR

This work addresses the challenge of adapting Vision-Language-Action models to new robot embodiments and tasks under limited data. It introduces Align-Then-stEer (ATE), a two-stage approach that first creates a unified action latent space by training two Info-VAEs and embedding adaptation actions into modes of the pre-training latent distribution via reverse KL, then employs classifier-guided latent guidance to steer diffusion- or flow-based VLAs during fine-tuning. The framework is plug-and-play, preserving VLA architectures while delivering substantial gains in cross-embodiment and cross-task manipulation on both simulated benchmarks (RoboTwin 1.0, ManiSkill3) and real dual-arm robots (RealMan), including improved convergence speed and robustness to perturbations. By constraining adaptation within a structured latent manifold and guiding outputs toward the target distribution, ATE preserves valuable visuomotor priors while enabling rapid deployment to new platforms and tasks, representing a practical step toward scalable, generalist robot policies.

Abstract

Vision-Language-Action (VLA) models pre-trained on large, diverse datasets show remarkable potential for general-purpose robotic manipulation. However, a primary bottleneck remains in adapting these models to downstream tasks, especially when the robot's embodiment or the task itself differs from the pre-training data. This discrepancy leads to a significant mismatch in action distributions, demanding extensive data and compute for effective fine-tuning. To address this challenge, we introduce \textbf{Align-Then-stEer (\texttt{ATE})}, a novel, data-efficient, and plug-and-play adaptation framework. \texttt{ATE} first aligns disparate action spaces by constructing a unified latent space, where a variational autoencoder constrained by reverse KL divergence embeds adaptation actions into modes of the pre-training action latent distribution. Subsequently, it steers the diffusion- or flow-based VLA's generation process during fine-tuning via a guidance mechanism that pushes the model's output distribution towards the target domain. We conduct extensive experiments on cross-embodiment and cross-task manipulation in both simulation and real world. Compared to direct fine-tuning of representative VLAs, our method improves the average multi-task success rate by up to \textbf{9.8\%} in simulation and achieves a striking \textbf{32\% success rate gain} in a real-world cross-embodiment setting. Our work presents a general and lightweight solution that greatly enhances the practicality of deploying VLA models to new robotic platforms and tasks.

Paper Structure

This paper contains 33 sections, 16 equations, 14 figures, 11 tables, 3 algorithms.

Figures (14)

  • Figure 1: We present ATE, a plug-and-play adaptation framework for pre-trained Vision-Language-Action (VLA) models. Unlike prior methods that directly fine-tune VLAs, ATE aligns disparate action spaces into a unified latent representation and steers the VLA' s generation via guidance, enabling data-efficient cross-task and cross-embodiment adaptation. This framework is evaluated in simulation on RoboTwin and ManiSkill benchmarks, as well as on a real-world dual-arm RealMan 7-DoF robot, demonstrating strong generalization, bimanual dexterous coordination, and minute-level long-horizon manipulation, achieving substantial gains in multi-task success rates.
  • Figure 2: The overview of ATE framework. (a) In the first stage, we construct a unified action space to bridge the embodiment gap in pretraining and adaptation stages by utilizing the mode-seeking behavior of asymmetric VAEs. (b) In the second stage, we integrate classifier guidance in diffusion and flow-based VLAs to steer the pretrained policy towards the target action distribution with specific robot platforms.
  • Figure 3: Evaluation of DP baseline with and without ATE across (a) Out-of-distribution tasks and (b) In-distribution tasks. Incorporating ATE consistently improves success rates and accelerates convergence, with the most significant gains observed on challenging tasks where baseline performance is low.
  • Figure 4: Real-world evaluation on physical robot experiments. Top panel reports success rates across four tasks and overall average, comparing ATE with the baseline π0. Bottom panel shows full execution trajectories of four representative tasks, covering long-horizon, single-arm, and dual-arm scenarios.
  • Figure 5: Real-world generalization settings. For the Spatial Generalization experiment, we vary the initial positions of manipulated objects. For the Visual Distractors experiment, we randomly place unseen items (e.g., fruits) in the scene. For the human disturbance experiment, we intervene at key execution points by resetting already completed sub-tasks.
  • ...and 9 more figures