Table of Contents
Fetching ...

Learning from Mistakes: Post-Training for Driving VLA with Takeover Data

Yinfeng Gao, Deqing Liu, Qichao Zhang, Yupeng Zheng, Haochen Tian, Guang Li, Hangjun Ye, Long Chen, Da-Wei Ding, Dongbin Zhao

Abstract

Current Vision-Language-Action (VLA) paradigms in end-to-end autonomous driving rely on offline training from static datasets, leaving them vulnerable to distribution shift. Recent post-training methods use takeover data to mitigate this by augmenting the dataset with high-quality expert takeover samples, yet they suffer from two key limitations: supervision restricted to the period after the takeover moments leads to policies with limited safety margins, and passive preference optimization lacks active exploration for optimal performance. In this paper, we propose TakeVLA, a novel VLA post-training framework that overcomes these shortcomings through two complementary innovations. First, we introduce pre-takeover language supervision, which allows the VLA to learn from mistakes proactively. By explicitly teaching the model about what to do in error-prone situations, we cultivate a precautionary mindset that anticipates hazards early and substantially enlarges safety margins. Second, we propose Scenario Dreaming, a reinforcement fine-tuning paradigm that operates in reconstruceted takeover scenarios, encouraging active exploration beyond mere preference fitting. Experiments on the Bench2Drive benchmark demonstrate that TakeVLA achieves state-of-the-art closed-loop performance, surpassing the strong VLA baseline SimLingo by 4.93 in driving score, with an enhanced safety margin as evidenced by an 11.76% increase in average TTC.

Learning from Mistakes: Post-Training for Driving VLA with Takeover Data

Abstract

Current Vision-Language-Action (VLA) paradigms in end-to-end autonomous driving rely on offline training from static datasets, leaving them vulnerable to distribution shift. Recent post-training methods use takeover data to mitigate this by augmenting the dataset with high-quality expert takeover samples, yet they suffer from two key limitations: supervision restricted to the period after the takeover moments leads to policies with limited safety margins, and passive preference optimization lacks active exploration for optimal performance. In this paper, we propose TakeVLA, a novel VLA post-training framework that overcomes these shortcomings through two complementary innovations. First, we introduce pre-takeover language supervision, which allows the VLA to learn from mistakes proactively. By explicitly teaching the model about what to do in error-prone situations, we cultivate a precautionary mindset that anticipates hazards early and substantially enlarges safety margins. Second, we propose Scenario Dreaming, a reinforcement fine-tuning paradigm that operates in reconstruceted takeover scenarios, encouraging active exploration beyond mere preference fitting. Experiments on the Bench2Drive benchmark demonstrate that TakeVLA achieves state-of-the-art closed-loop performance, surpassing the strong VLA baseline SimLingo by 4.93 in driving score, with an enhanced safety margin as evidenced by an 11.76% increase in average TTC.
Paper Structure (17 sections, 4 equations, 7 figures, 4 tables)

This paper contains 17 sections, 4 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We propose TakeVLA, a post-training framework for driving VLA with takeover data. (a) TakeVLA iteratively collects takeover data via online expert interventions and conducts post-training on the collected data. It introduces pre-takeover supervision for larger safety margins and Scenario Dreaming for active exploration toward optimal policy. (b) t-SNE visualization shows that the takeover dataset (red) provides valuable OOD samples compared to the pretrain dataset (blue). (c) Closed-loop evaluation on Bench2Drive demonstrates TakeVLA's superior performance, with a higher driving score, success rate, and significantly improved safety margins.
  • Figure 2: Post-training pipiline for TakeVLA. Starting from a pre-trained VLA model, each round consists of three key steps: (1) online interaction with expert takeovers to construct pre-takeover and takeover datasets, (2) supervised fine-tuning on the constructed datasets, and (3) reinforcement fine-tuning via Scenario Dreaming in reconstructed takeover scenarios. Multiple rounds progressively enhance language-conditioned driving performance.
  • Figure 3: Architecture of the baseline VLA model.
  • Figure 4: Language label enhancement. For each takeover trigger type (Follow, Collision, Restart), original language labels (middle) are refined (bottom) based on the specific cause. Enhanced labels emphasize urgency and explicit causality, leading to more conservative and effective driving actions in critical scenarios.
  • Figure 5: Impact of sampling ratio during SFT. The colored bar visualizes the sampling ratio between pre-training and takeover-related buckets.
  • ...and 2 more figures