Table of Contents
Fetching ...

The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

Runhao Mao, Hanshi Wang, Yixiang Yang, Qianli Ma, Jingmeng Zhou, Zhipeng Zhang

Abstract

The integration of Vision-Language Models (VLMs) into autonomous driving promises to solve long-tail scenarios, but this paradigm faces the critical and unaddressed challenge of catastrophic forgetting. The very fine-tuning process used to adapt these models to driving-specific data simultaneously erodes their invaluable pre-trained world knowledge, creating a self-defeating paradox that undermines the core reason for their use. This paper provides the first systematic investigation into this phenomenon. We introduce a new large-scale dataset of 180K scenes, which enables the first-ever benchmark specifically designed to quantify catastrophic forgetting in autonomous driving. Our analysis reveals that existing methods suffer from significant knowledge degradation. To address this, we propose the Drive Expert Adapter (DEA), a novel framework that circumvents this trade-off by shifting adaptation from the weight space to the prompt space. DEA dynamically routes inference through different knowledge experts based on scene-specific cues, enhancing driving-task performance without corrupting the model's foundational parameters. Extensive experiments demonstrate that our approach not only achieves state-of-the-art results on driving tasks but also effectively mitigates catastrophic forgetting, preserving the essential generalization capabilities that make VLMs a transformative force for autonomous systems. Data and model are released at FidelityDrivingBench.

The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

Abstract

The integration of Vision-Language Models (VLMs) into autonomous driving promises to solve long-tail scenarios, but this paradigm faces the critical and unaddressed challenge of catastrophic forgetting. The very fine-tuning process used to adapt these models to driving-specific data simultaneously erodes their invaluable pre-trained world knowledge, creating a self-defeating paradox that undermines the core reason for their use. This paper provides the first systematic investigation into this phenomenon. We introduce a new large-scale dataset of 180K scenes, which enables the first-ever benchmark specifically designed to quantify catastrophic forgetting in autonomous driving. Our analysis reveals that existing methods suffer from significant knowledge degradation. To address this, we propose the Drive Expert Adapter (DEA), a novel framework that circumvents this trade-off by shifting adaptation from the weight space to the prompt space. DEA dynamically routes inference through different knowledge experts based on scene-specific cues, enhancing driving-task performance without corrupting the model's foundational parameters. Extensive experiments demonstrate that our approach not only achieves state-of-the-art results on driving tasks but also effectively mitigates catastrophic forgetting, preserving the essential generalization capabilities that make VLMs a transformative force for autonomous systems. Data and model are released at FidelityDrivingBench.

Paper Structure

This paper contains 16 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Illustration of Fidelity Driving Bench. We introduce a benchmark to quantify knowledge forgetting in general VLMs after fine-tuning on driving data, spanning 180K frames and 900K long-tail QA pairs, covering 3 tasks across 15 data sources with 2 forgetting metrics, and revealing 3 forgetting phenomena.
  • Figure 2: Catastrophic forgetting leads to degraded generalization in long-tail scenarios, which may result in safety-critical failures.
  • Figure 3: The proposed dataset construction pipeline. We first integrated fifteen existing annotated datasets and additionally provided language annotations for the WOD-E2E dataset. After that, each scene is represented by a set of sparse elements, which are automatically extracted from annotations using GPT. Then, we conduct manual verification to ensure accuracy and diversity. Finally, we retain 1,000 representative images as the test set, and the corresponding statistical summaries are presented in the bottom-right pane.
  • Figure 4: Noteworthy Objects’ Perception Noteworthy Objects’ Perception Recall across fine-tuning epochs on our benchmark. Each curve corresponds to a specific backbone and tuning strategy.
  • Figure 5: Illustration of Driving Expert Adapter. Our framework comprises a Prompt Adapter that selects the most suitable learnable prompt tokens for the current scenario, and a Task-Adaptive Expert Module that leverages a gated network over all tokens to activate the LoRA experts most appropriate for the current scene.
  • ...and 2 more figures