AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

Yutong Hu; Jan-Nico Zaech; Nikolay Nikolov; Yuanqi Yao; Sombit Dey; Giuliano Albanese; Renaud Detry; Luc Van Gool; Danda Paudel

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

Yutong Hu, Jan-Nico Zaech, Nikolay Nikolov, Yuanqi Yao, Sombit Dey, Giuliano Albanese, Renaud Detry, Luc Van Gool, Danda Paudel

TL;DR

This work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies and can effectively replace traditional chunk-based action heads for both specialist and generalist policies.

Abstract

We propose a standalone autoregressive (AR) Action Expert that generates actions as a continuous causal sequence while conditioning on refreshable vision-language prefixes. In contrast to existing Vision-Language-Action (VLA) models and diffusion policies that reset temporal context with each new observation and predict actions reactively, our Action Expert maintains its own history through a long-lived memory and is inherently context-aware. This structure addresses the frequency mismatch between fast control and slow reasoning, enabling efficient independent pretraining of kinematic syntax and modular integration with heavy perception backbones, naturally ensuring spatio-temporally consistent action generation across frames. To synchronize these asynchronous hybrid V-L-A modalities, we utilize a re-anchoring mechanism that mathematically accounts for perception staleness during both training and inference. Experiments on simulated and real-robot manipulation tasks demonstrate that the proposed method can effectively replace traditional chunk-based action heads for both specialist and generalist policies. AR-VLA exhibits superior history awareness and substantially smoother action trajectories while maintaining or exceeding the task success rates of state-of-the-art reactive VLAs. Overall, our work introduces a scalable, context-aware action generation schema that provides a robust structural foundation for training effective robotic policies.

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

TL;DR

Abstract

Paper Structure (22 sections, 7 equations, 15 figures, 8 tables)

This paper contains 22 sections, 7 equations, 15 figures, 8 tables.

Introduction
Related Work
Vision-Language-Action models (VLAs).
Action Representation and Pretraining.
Architectures with context awareness
Methodology
Problem Formulation
Model Structure
Training Details
Inference Details
Experiments
Generalist and Specialist Policy Performance
Efficiency and Smoothness Analysis
History-Awareness Evaluation
Ablations on Design Decisions
...and 7 more sections

Figures (15)

Figure 2: Performance Overview. (a) Quantitative Results: In both generalist (left) and specialist (right) benchmarks, AR-VLA achieves competitive or superior performance compared to state-of-the-art policies, including OpenVLA, Flow-Matching (FM), ACT, and Diffusion Policy (DP), details in Sec.\ref{['sec:bench']}. (b) Trajectory Quality: Qualitative visualization of joint trajectories over time reveals that AR-VLA produces significantly smoother and more kinematically consistent motion compared to reactive baselines that reset context at each step (analysis in Sec.\ref{['sec:smoothness']}). (c) Long-Horizon Capability: AR-VLA successfully completes long-horizon tasks where baselines like DP and FM fail due to a lack of temporal context awareness. Detailed task defination and explanation in Sec. \ref{['sec:history']}.
Figure 3: The AR-VLA Framework. The system bridges an VLM backbone with a autoregressive Action Expert asynchronously. Atemporal features from the VLM are explicitly injected with temporal context via Dynamic Temporal Re-anchoring (DTR). Within the Hybrid KV Cache, re-anchored VL tokens (green) serve as a semantic prefix to the rolling kinematic history (orange). The Action Expert generates future action sequences by querying this shared cache using incrementally advancing step embeddings.
Figure 4: Heterogeneous FIFO Update Rules for the Hybrid KV Cache. The framework manages memory through two distinct queueing strategies to ensure efficient context utilization. The VL Stream (green) operates as a short-lived, block-wise FIFO: In contrast, the Action Stream (orange) maintains a token-wise rolling FIFO, continuously appending the single latest action prediction while evicting the oldest kinematic state.
Figure 5: Simulation benchmarks setups. We do simulation evaluation spanning generalist and specialist policies, with diverse embodiment, action space, and task.
Figure 6: BridgeV2 pretraining to real-world WidowX Zero-Shot Performance Comparison. As a property of VLA models, the released weights work out-of-the-box without an accurate requirement for the camera pose. We set the camera pose so that all methods reach a 100% success rate on an easy in-distribution task, then test them zero-shot on challenging tasks. Details of experiment protocol in Appendix.
...and 10 more figures

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

TL;DR

Abstract

AR-VLA: True Autoregressive Action Expert for Vision-Language-Action Models

Authors

TL;DR

Abstract

Table of Contents

Figures (15)