Table of Contents
Fetching ...

Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation

Xiangkai Ma, Lekai Xing, Han Zhang, Wenzhong Li, Sanglu Lu

TL;DR

VITA introduces Vision-Integrated Trajectory Alignment, a hybrid-modality pipeline that unifies perception and action through a shared discrete latent space and an implicit visual chain-of-thought. By coupling cross-modal vector quantization with a two-stage VLM backbone (textual CoT and visual CoT) and a warmup/Co-train training regimen, the model learns motion priors from diverse video data while directly associating visual dynamics with motor commands. Empirically, VITA achieves state-of-the-art or competitive performance on CALVIN, LIBERO, and SimplerEnv benchmarks and reaches an average real-world success rate of 80.5% across six tasks, with robust ID/OOD generalization. The approach reduces inference latency and demonstrates strong data efficiency, supporting its potential as a generalist robotic manipulation model for real-world deployment.

Abstract

Vision-Language-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual CoT: autoregressively generated tokens is simultaneously decoded into future frames predictions and robot actions, thereby internalizing visual dynamics as an inductive bias for motion planning. Extensive experiments on simulated and real-world environments demonstrate state-of-the-art performance. VITA improves 14.5\%, 9.6\% and 12.1\% over existing baselines on CALVIN, LIBERO and SimplerEnv. Furthermore, VITA attains an average success rate of 80.5\% across six real-world tasks, demonstrating its potential as a generalist robotic manipulation model.

Unifying Perception and Action: A Hybrid-Modality Pipeline with Implicit Visual Chain-of-Thought for Robotic Action Generation

TL;DR

VITA introduces Vision-Integrated Trajectory Alignment, a hybrid-modality pipeline that unifies perception and action through a shared discrete latent space and an implicit visual chain-of-thought. By coupling cross-modal vector quantization with a two-stage VLM backbone (textual CoT and visual CoT) and a warmup/Co-train training regimen, the model learns motion priors from diverse video data while directly associating visual dynamics with motor commands. Empirically, VITA achieves state-of-the-art or competitive performance on CALVIN, LIBERO, and SimplerEnv benchmarks and reaches an average real-world success rate of 80.5% across six tasks, with robust ID/OOD generalization. The approach reduces inference latency and demonstrates strong data efficiency, supporting its potential as a generalist robotic manipulation model for real-world deployment.

Abstract

Vision-Language-Action (VLA) models built upon Chain-of-Thought (CoT) have achieved remarkable success in advancing general-purpose robotic agents, owing to its significant perceptual comprehension. Recently, since text-only CoT struggles to adequately capture scene details in complex spatial environments, a highly promising strategy involves leveraging visual priors to guide robotic action generation. Nevertheless, these strategies face two inherent challenges: (i) a modality gap between visual observations and low-level actions, and (ii) unstable training due to competing objectives between visual prediction and action generation. To address these challenges, we propose a Vision-Integrated Trajectory Alignment (VITA) framework that learns a shared discrete latent space for vision and action, enabling joint modeling of perception and motor control. VITA introduces a implicit visual CoT: autoregressively generated tokens is simultaneously decoded into future frames predictions and robot actions, thereby internalizing visual dynamics as an inductive bias for motion planning. Extensive experiments on simulated and real-world environments demonstrate state-of-the-art performance. VITA improves 14.5\%, 9.6\% and 12.1\% over existing baselines on CALVIN, LIBERO and SimplerEnv. Furthermore, VITA attains an average success rate of 80.5\% across six real-world tasks, demonstrating its potential as a generalist robotic manipulation model.

Paper Structure

This paper contains 65 sections, 20 equations, 19 figures, 12 tables.

Figures (19)

  • Figure 1: Overview of the VITA framework. Utilizing the cross-modal alignment in ①, visual perception and motor control modalities are unified in the shared discrete latent space, where the dual-autoencoder architectures are illustrated in ② and ③. Benefiting from the representation alignment, the VLM backbone in ④ generates dynamics-unified tokens via a hybrid attention mechanism. These tokens are decoded into future frames and robot actions, as Internal CoT.
  • Figure 2: Visualization of the "contextual reasoning and color matching" in the real world.
  • Figure 3: visualizations of VITA’s generated action trajectories on simulations CALVIN, SimplerEnv Google Robot and WidowX.
  • Figure 4: Demonstration of our real-world robotic platform.
  • Figure 5: (a) Based on the established real-world robotic platform, we designed multiple scenarios to collect training data for the UR-5e agent. (b) Furthermore, we also presented the benchmark in the simulation environment.
  • ...and 14 more figures