Table of Contents
Fetching ...

VLM-Assisted Continual learning for Visual Question Answering in Self-Driving

Yuxin Lin, Mengshi Qi, Liang Liu, Huadong Ma

TL;DR

The paper tackles Visual Question Answering in autonomous driving by integrating Vision-Language Models with continual learning to combat catastrophic forgetting across perception, prediction, planning, and behavior. It introduces a hybrid framework that combines memory replay with selective knowledge distillation and per-task embedding projection to preserve past knowledge while learning new driving tasks. Memory samples are curated via TF-IDF and K-means to ensure diverse and representative replay data, while dynamic projection layers constrain feature drift across tasks. Empirical results on the DriveLM dataset show significant improvements over baselines in standard VQA metrics, and ablations confirm the complementary contributions of memory replay, KD, and projection regularization. The work advances resilient, multimodal reasoning for self-driving systems and provides practical guidance for deploying continual learning in safety-critical autonomous platforms.

Abstract

In this paper, we propose a novel approach for solving the Visual Question Answering (VQA) task in autonomous driving by integrating Vision-Language Models (VLMs) with continual learning. In autonomous driving, VQA plays a vital role in enabling the system to understand and reason about its surroundings. However, traditional models often struggle with catastrophic forgetting when sequentially exposed to new driving tasks, such as perception, prediction, and planning, each requiring different forms of knowledge. To address this challenge, we present a novel continual learning framework that combines VLMs with selective memory replay and knowledge distillation, reinforced by task-specific projection layer regularization. The knowledge distillation allows a previously trained model to act as a "teacher" to guide the model through subsequent tasks, minimizing forgetting. Meanwhile, task-specific projection layers calculate the loss based on the divergence of feature representations, ensuring continuity in learning and reducing the shift between tasks. Evaluated on the DriveLM dataset, our framework shows substantial performance improvements, with gains ranging from 20.11% to 35.16% across various metrics. These results highlight the effectiveness of combining continual learning with VLMs in enhancing the resilience and reliability of VQA systems in autonomous driving. We will release our source code.

VLM-Assisted Continual learning for Visual Question Answering in Self-Driving

TL;DR

The paper tackles Visual Question Answering in autonomous driving by integrating Vision-Language Models with continual learning to combat catastrophic forgetting across perception, prediction, planning, and behavior. It introduces a hybrid framework that combines memory replay with selective knowledge distillation and per-task embedding projection to preserve past knowledge while learning new driving tasks. Memory samples are curated via TF-IDF and K-means to ensure diverse and representative replay data, while dynamic projection layers constrain feature drift across tasks. Empirical results on the DriveLM dataset show significant improvements over baselines in standard VQA metrics, and ablations confirm the complementary contributions of memory replay, KD, and projection regularization. The work advances resilient, multimodal reasoning for self-driving systems and provides practical guidance for deploying continual learning in safety-critical autonomous platforms.

Abstract

In this paper, we propose a novel approach for solving the Visual Question Answering (VQA) task in autonomous driving by integrating Vision-Language Models (VLMs) with continual learning. In autonomous driving, VQA plays a vital role in enabling the system to understand and reason about its surroundings. However, traditional models often struggle with catastrophic forgetting when sequentially exposed to new driving tasks, such as perception, prediction, and planning, each requiring different forms of knowledge. To address this challenge, we present a novel continual learning framework that combines VLMs with selective memory replay and knowledge distillation, reinforced by task-specific projection layer regularization. The knowledge distillation allows a previously trained model to act as a "teacher" to guide the model through subsequent tasks, minimizing forgetting. Meanwhile, task-specific projection layers calculate the loss based on the divergence of feature representations, ensuring continuity in learning and reducing the shift between tasks. Evaluated on the DriveLM dataset, our framework shows substantial performance improvements, with gains ranging from 20.11% to 35.16% across various metrics. These results highlight the effectiveness of combining continual learning with VLMs in enhancing the resilience and reliability of VQA systems in autonomous driving. We will release our source code.

Paper Structure

This paper contains 28 sections, 16 equations, 13 figures, 8 tables, 2 algorithms.

Figures (13)

  • Figure 1: This illustration presents a framework for a Vision-Language Model (VLM) designed for multi-task autonomous driving. (a) The upper section shows four essential tasks: Perception, Prediction, Planning, and Behavior, each represented with a specific question-answer example to demonstrate the model's response to various driving scenarios. (b) The lower section outlines the model's simplified pipeline, incorporating memory replay with knowledge distillation and projection layer regularization to enhance continual learning capabilities across tasks. (c) The diagram in the bottom-right corner illustrates how continual learning methods enable the model to add new knowledge and functionality without overwriting or erasing prior knowledge.
  • Figure 2: (a) Memory replay with knowledge distillation shows memory replay and knowledge distillation when Memory data coming in. Optimizing data selection using TF-IDF and K-means clustering to maintain a diverse and representative memory set. (b) Task-specific embedding projection regularization demonstrates task-specific projection layers within an autonomous driving model to maintain feature continuity and mitigate catastrophic forgetting by transforming multimodal embedding into unique task-specific spaces and regulating the model with a specialized loss function.
  • Figure 3: Examples of correct and failure answer generations from our model.
  • Figure 4: Illustration of a detailed visual analysis about how different configuration settings impact the performance metrics of our continual learning model for autonomous driving. Each subfigure presents a unique aspect of model tuning and its direct correlation with the performance in language and vision tasks.
  • Figure 5: Bubble chart visualization of the perception task clusters generated using the TF-IDF and K-means pipeline.
  • ...and 8 more figures