Table of Contents
Fetching ...

One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering

Deepayan Das, Davide Talon, Massimiliano Mancini, Yiming Wang, Elisa Ricci

TL;DR

This paper introduces GaB, a data-free rehearsal framework for continual visual question answering (VQACL) that leverages the generative language capabilities of Vision-Language Models to synthesize pseudo-rehearsal question-answer pairs on current task images. A key challenge is the skewed distribution of generated questions; GaB addresses this with a balancing module that aligns synthetic data with ground-truth question-type statistics, either via meta-information-based partitioning or unsupervised clustering. The approach trains task-specific QA generators while freezing past heads, builds a balanced rehearsal buffer from generated data, and updates a shared VQA head through sequential learning. Experiments on VQACL-VQAv2 and CLOVE-function show GaB surpasses all data-free baselines and approaches the performance of methods with access to past data, highlighting its practical potential under privacy or storage constraints.

Abstract

Vision-Language Models (VLMs) have shown significant promise in Visual Question Answering (VQA) tasks by leveraging web-scale multimodal datasets. However, these models often struggle with continual learning due to catastrophic forgetting when adapting to new tasks. As an effective remedy to mitigate catastrophic forgetting, rehearsal strategy uses the data of past tasks upon learning new task. However, such strategy incurs the need of storing past data, which might not be feasible due to hardware constraints or privacy concerns. In this work, we propose the first data-free method that leverages the language generation capability of a VLM, instead of relying on external models, to produce pseudo-rehearsal data for addressing continual VQA. Our proposal, named as GaB, generates pseudo-rehearsal data by posing previous task questions on new task data. Yet, despite being effective, the distribution of generated questions skews towards the most frequently posed questions due to the limited and task-specific training data. To mitigate this issue, we introduce a pseudo-rehearsal balancing module that aligns the generated data towards the ground-truth data distribution using either the question meta-statistics or an unsupervised clustering method. We evaluate our proposed method on two recent benchmarks, \ie VQACL-VQAv2 and CLOVE-function benchmarks. GaB outperforms all the data-free baselines with substantial improvement in maintaining VQA performance across evolving tasks, while being on-par with methods with access to the past data.

One VLM to Keep it Learning: Generation and Balancing for Data-free Continual Visual Question Answering

TL;DR

This paper introduces GaB, a data-free rehearsal framework for continual visual question answering (VQACL) that leverages the generative language capabilities of Vision-Language Models to synthesize pseudo-rehearsal question-answer pairs on current task images. A key challenge is the skewed distribution of generated questions; GaB addresses this with a balancing module that aligns synthetic data with ground-truth question-type statistics, either via meta-information-based partitioning or unsupervised clustering. The approach trains task-specific QA generators while freezing past heads, builds a balanced rehearsal buffer from generated data, and updates a shared VQA head through sequential learning. Experiments on VQACL-VQAv2 and CLOVE-function show GaB surpasses all data-free baselines and approaches the performance of methods with access to past data, highlighting its practical potential under privacy or storage constraints.

Abstract

Vision-Language Models (VLMs) have shown significant promise in Visual Question Answering (VQA) tasks by leveraging web-scale multimodal datasets. However, these models often struggle with continual learning due to catastrophic forgetting when adapting to new tasks. As an effective remedy to mitigate catastrophic forgetting, rehearsal strategy uses the data of past tasks upon learning new task. However, such strategy incurs the need of storing past data, which might not be feasible due to hardware constraints or privacy concerns. In this work, we propose the first data-free method that leverages the language generation capability of a VLM, instead of relying on external models, to produce pseudo-rehearsal data for addressing continual VQA. Our proposal, named as GaB, generates pseudo-rehearsal data by posing previous task questions on new task data. Yet, despite being effective, the distribution of generated questions skews towards the most frequently posed questions due to the limited and task-specific training data. To mitigate this issue, we introduce a pseudo-rehearsal balancing module that aligns the generated data towards the ground-truth data distribution using either the question meta-statistics or an unsupervised clustering method. We evaluate our proposed method on two recent benchmarks, \ie VQACL-VQAv2 and CLOVE-function benchmarks. GaB outperforms all the data-free baselines with substantial improvement in maintaining VQA performance across evolving tasks, while being on-par with methods with access to the past data.

Paper Structure

This paper contains 30 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Key ideas. Top: we explore the language generation capability of VLMs to synthesize pseudo-rehearsal data of previous tasks to mitigate catastrophic forgetting in continual VQA. Bottom: as pseudo-rehearsal data tends to skew towards particular question types, we further propose a pseudo-rehearsal balancing module to align such skewed distribution towards the ground-truth meta-statistics, effectively improving the task performance while avoiding forgetting.
  • Figure 2: Architecture of the proposed data-free method GaB for addressing VQACL.(a) At task $t$, past task-specific projection heads $f^s_{v\to qa}, s=1\mathinner {\ldotp \ldotp}t-1$ are used to generate pseudo-rehearsal data with question-answer pairs about old tasks on current task images. (b) Pseudo-rehearsal samples undergo a balancing process through a module designed to ensure that under-represented question types are adequately represented. Finally (c) uses the pseudo-rehearsal data to mitigate the forgetting in the sequential learning scenario and data $D_t$ to learn the current question-answer projector $f^t_{v \to qa}$.
  • Figure 3: Data-bias present in real vs generated data. Generated questions are heavily skewed towards a certain type of questions. Left: question distribution for task type. Right: question distribution for task location.
  • Figure 4: Analysis of balancing strategies at varying rehearsal buffer sizes (1k, 2.5k, and 5k samples) on the VQACL-VQAv2 benchmark in terms of AP.
  • Figure 5: Qualitative visualization of the generated pseudo-rehearsal data on the VQACL-VQAv2 benchmark. Top row: analysis on generation conditioning for questions balancing. Bottom row: analysis on the pseudo-strategy used for answer generation.