Table of Contents
Fetching ...

DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

Shuyao Shang, Bing Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, Tieniu Tan

TL;DR

DynVLA is a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT that captures the evolution of the world in a compact, interpretable, and efficient form and consistently outperforms Textual CoT and Visual CoT methods.

Abstract

We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

TL;DR

DynVLA is a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT that captures the evolution of the world in a compact, interpretable, and efficient form and consistently outperforms Textual CoT and Visual CoT methods.

Abstract

We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.
Paper Structure (53 sections, 11 equations, 10 figures, 8 tables)

This paper contains 53 sections, 11 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Comparison of different CoT paradigms in autonomous driving VLA models. (a) Textual CoT suffers from limited spatiotemporal understanding and high inference latency due to long textual reasoning traces. (b) Visual CoT introduces substantial redundancy and computational overhead from pixel-level generation. (c) Dynamics CoT compresses future dynamics into a small set of tokens, achieving latency-efficient inference with compact reasoning and accurate spatiotemporal modeling.
  • Figure 2: Overview of the DynVLA. (a) Given adjacent image observations, a dynamics encoder extracts ego-centric and environment-centric dynamics, which are discretized via VQ codebooks. Then, the ego-centric dynamics are regularized by the GT ego action, and the combined dynamics are decoded to reconstruct the future image and BEV map conditioned on each current state. (b) DynVLA is supervised to first generate discrete dynamics tokens followed by action tokens, forming structured Dynamics CoT modeling.
  • Figure 3: Overview of the training pipeline for DynVLA. DynVLA first learns a Dynamics Tokenizer by reconstructing future states from adjacent frames, producing discrete dynamics tokens. It then performs SFT on Dynamics CoT, training the model to generate dynamics tokens before action tokens. Finally, the policy is optimized via RFT with trajectory-level reward and KL regularization.
  • Figure 4: Transferability of learned dynamics. Dynamics tokens extracted from one scenario are injected into a new scene and decoded into the future image and the BEV map. We contrast the current states, the future states decoded with transferred dynamics, and the original future states. The results show that both ego-centric and environment-centric dynamics are transferable across scenarios.
  • Figure 5: Codebook collapse without dynamics decouple.
  • ...and 5 more figures