Table of Contents
Fetching ...

CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding

Wenxuan Song, Jiayi Chen, Pengxiang Ding, Yuxin Huang, Han Zhao, Donglin Wang, Haoang Li

TL;DR

This work tackles the inference speed bottleneck in Vision-Language-Action models for robotics by introducing CEED-VLA, which combines consistency distillation with mixed-label supervision and an early-exit decoding strategy to accelerate Jacobi-based decoding. The approach trains a student VLA to map Jacobi trajectory states to fixed points, enabling multiple correct tokens per iteration and significantly reducing iterations while preserving task performance. Extensive simulations on CALVIN and LIBERO, plus real-world dual-arm experiments, show roughly 3–4x speedups and up to 4x higher action frequencies with comparable success rates, validating the method's practical impact for real-time robotic manipulation. The work offers a general acceleration framework for multimodal decision-making in robotics and highlights fixed-token phenomena as a core driver of speedups, with clear avenues for future improvements in data efficiency and convergence control.

Abstract

In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves more than 4 times inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics. Our project page is available at https://irpn-eai.github.io/CEED-VLA/.

CEED-VLA: Consistency Vision-Language-Action Model with Early-Exit Decoding

TL;DR

This work tackles the inference speed bottleneck in Vision-Language-Action models for robotics by introducing CEED-VLA, which combines consistency distillation with mixed-label supervision and an early-exit decoding strategy to accelerate Jacobi-based decoding. The approach trains a student VLA to map Jacobi trajectory states to fixed points, enabling multiple correct tokens per iteration and significantly reducing iterations while preserving task performance. Extensive simulations on CALVIN and LIBERO, plus real-world dual-arm experiments, show roughly 3–4x speedups and up to 4x higher action frequencies with comparable success rates, validating the method's practical impact for real-time robotic manipulation. The work offers a general acceleration framework for multimodal decision-making in robotics and highlights fixed-token phenomena as a core driver of speedups, with clear avenues for future improvements in data efficiency and convergence control.

Abstract

In recent years, Vision-Language-Action (VLA) models have become a vital research direction in robotics due to their impressive multimodal understanding and generalization capabilities. Despite the progress, their practical deployment is severely constrained by inference speed bottlenecks, particularly in high-frequency and dexterous manipulation tasks. While recent studies have explored Jacobi decoding as a more efficient alternative to traditional autoregressive decoding, its practical benefits are marginal due to the lengthy iterations. To address it, we introduce consistency distillation training to predict multiple correct action tokens in each iteration, thereby achieving acceleration. Besides, we design mixed-label supervision to mitigate the error accumulation during distillation. Although distillation brings acceptable speedup, we identify that certain inefficient iterations remain a critical bottleneck. To tackle this, we propose an early-exit decoding strategy that moderately relaxes convergence conditions, which further improves average inference efficiency. Experimental results show that the proposed method achieves more than 4 times inference acceleration across different baselines while maintaining high task success rates in both simulated and real-world robot tasks. These experiments validate that our approach provides an efficient and general paradigm for accelerating multimodal decision-making in robotics. Our project page is available at https://irpn-eai.github.io/CEED-VLA/.

Paper Structure

This paper contains 35 sections, 6 equations, 10 figures, 8 tables, 2 algorithms.

Figures (10)

  • Figure 1: Acceleration effect of CEED-VLA on OpenVLA and LLaVA-VLA.Left: Comparison of the number of iterations required for a complete output. CEED-VLA largely reduces the iterations, thus allowing faster decoding. Right: Comparison of the decoding speed. The speedup of directly running Jacobi decoding in a vanilla VLA is marginal. Our CEED-VLA seperately realizes 3.6$\times$ and 2.0$\times$ speedup with negligible performance degradation on OpenVLA and LLaVA-VLA. In scenarios targeting more aggressive acceleration, CEED-VLA-Turbo delivers even fewer iterations and much more speedup while incurring only a slight degradation in performance.
  • Figure 2: Overview of our proposed CEED-VLA. Our proposed framework first runs the pretrained VLA (e.g., LLaVA-VLA) with Jacobi decoding to generate the training dataset. Then we design an effective consistency distillation process with novel mixed-label supervision to get the student model. Finally, we propose early-exit decoding to further unlock inference speed. Experiments in simulators and the real world show significant acceleration with comparative success rates.
  • Figure 3: L1 distance between the generated Jacobi trajectory dataset and the ground-truth data.
  • Figure 4: An instance of Jacobi trajectory with early-exit decoding. Gray numbers indicate incorrect tokens, while blue numbers denote correct ones. Blue numbers with underlines represent fixed tokens. The three rows from bottom to top illustrate the Jacobi trajectory, starting from the initialized point and ending at the exit point. The topmost row represents the Jacobi fixed point.
  • Figure 5: Speedup and average length of CEED-VLA decoding with different values of exit point (left) and trained with different data amounts (right). On the left, our CEED-VLA employs an exit point of 16, and the extremely accelerated version CEED-VLA-Turbo exits at 8.
  • ...and 5 more figures