Table of Contents
Fetching ...

Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

Malvina Nikandrou, Georgios Pantazopoulos, Ioannis Konstas, Alessandro Suglia

TL;DR

It is demonstrated that each modality evolves at different rates across a continuum of tasks and that this behavior occurs in established encoder-only models as well as modern recipes for developing Vision & Language models.

Abstract

Continual learning focuses on incrementally training a model on a sequence of tasks with the aim of learning new tasks while minimizing performance drop on previous tasks. Existing approaches at the intersection of Continual Learning and Visual Question Answering (VQA) do not study how the multimodal nature of the input affects the learning dynamics of a model. In this paper, we demonstrate that each modality evolves at different rates across a continuum of tasks and that this behavior occurs in established encoder-only models as well as modern recipes for developing Vision & Language (VL) models. Motivated by this observation, we propose a modality-aware feature distillation (MAFED) approach which outperforms existing baselines across models of varying scale in three multimodal continual learning settings. Furthermore, we provide ablations showcasing that modality-aware distillation complements experience replay. Overall, our results emphasize the importance of addressing modality-specific dynamics to prevent forgetting in multimodal continual learning.

Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation

TL;DR

It is demonstrated that each modality evolves at different rates across a continuum of tasks and that this behavior occurs in established encoder-only models as well as modern recipes for developing Vision & Language models.

Abstract

Continual learning focuses on incrementally training a model on a sequence of tasks with the aim of learning new tasks while minimizing performance drop on previous tasks. Existing approaches at the intersection of Continual Learning and Visual Question Answering (VQA) do not study how the multimodal nature of the input affects the learning dynamics of a model. In this paper, we demonstrate that each modality evolves at different rates across a continuum of tasks and that this behavior occurs in established encoder-only models as well as modern recipes for developing Vision & Language (VL) models. Motivated by this observation, we propose a modality-aware feature distillation (MAFED) approach which outperforms existing baselines across models of varying scale in three multimodal continual learning settings. Furthermore, we provide ablations showcasing that modality-aware distillation complements experience replay. Overall, our results emphasize the importance of addressing modality-specific dynamics to prevent forgetting in multimodal continual learning.
Paper Structure (31 sections, 7 equations, 5 figures, 5 tables)

This paper contains 31 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of MAFED. Along with training on the data from the current task and a memory buffer, we apply feature distillation using the previous checkpoint as the teacher. The distillation losses applied to the representations from question and visual tokens are weighted separately to compensate for modality-specific training dynamics.
  • Figure 2: Illustration of tasks in each of the three continual learning settings for VQA. Each of these settings consists of five tasks. The first two settings are defined based on the visual categories. In Diverse Content, the objects present in each task are grouped randomly, while in Taxonomy Content, the objects are grouped based on their supercategory. Finally, in Question Types, the tasks are defined according to the type of the questions.
  • Figure 3: Ratio of text-to-image representation similarity across layers and tasks, for UNITER (first row), ViLT (second-row), and VL-Pythia (third-row). We consistently observe that in the earlier layers, the ratio is close to one, indicating that representations from both modalities change at a similar rate. However, in intermediate or deeper layers, text representations seem to retain larger similarities.
  • Figure 4: Ablation of feature distillation from a single or cumulative (+) model layers.
  • Figure 5: Language weight during MAFED-A. Note that language tokens in the encoder family (UNITER and ViLT) are weighted similarly across the layers of the models. For the causal VL-Pythia model, the language tokens have higher weights.