Table of Contents
Fetching ...

DAM: Dynamic Adapter Merging for Continual Video QA Learning

Feng Cheng, Ziyang Wang, Yi-Lin Sung, Yan-Bo Lin, Mohit Bansal, Gedas Bertasius

TL;DR

DAM addresses the problem of continual VidQA under domain shifts by freezing a large pretrained video-language backbone and learning per-domain adapters, then using a non-parametric router to assign relevance and a dynamic per-sample adapter merging (DaM) to tailor predictions. The method mitigates forgetting, supports efficient adaptation to streaming datasets, and enables knowledge sharing across similar domains, achieving a 9.1% accuracy boost and 1.9% less forgetting over state-of-the-art DIL baselines on 6 VidQA datasets. Beyond VidQA, DaM generalizes to continual image classification and image QA, with substantial gains over prior methods. The approach is lightweight, scalable to multiple domains, and accompanied by public code, making it practical for real-world domain-incremental learning challenges.

Abstract

We present a parameter-efficient method for continual video question-answering (VidQA) learning. Our method, named DAM, uses the proposed Dynamic Adapter Merging to (i) mitigate catastrophic forgetting, (ii) enable efficient adaptation to continually arriving datasets, (iii) handle inputs from unknown datasets during inference, and (iv) enable knowledge sharing across similar dataset domains. Given a set of continually streaming VidQA datasets, we sequentially train dataset-specific adapters for each dataset while freezing the parameters of a large pretrained video-language backbone. During inference, given a video-question sample from an unknown domain, our method first uses the proposed non-parametric router function to compute a probability for each adapter, reflecting how relevant that adapter is to the current video-question input instance. Subsequently, the proposed dynamic adapter merging scheme aggregates all the adapter weights into a new adapter instance tailored for that particular test sample to compute the final VidQA prediction, mitigating the impact of inaccurate router predictions and facilitating knowledge sharing across domains. Our DAM model outperforms prior state-of-the-art continual learning approaches by 9.1% while exhibiting 1.9% less forgetting on 6 VidQA datasets spanning various domains. We further extend DAM to continual image classification and image QA and outperform prior methods by a large margin. The code is publicly available at: https://github.com/klauscc/DAM

DAM: Dynamic Adapter Merging for Continual Video QA Learning

TL;DR

DAM addresses the problem of continual VidQA under domain shifts by freezing a large pretrained video-language backbone and learning per-domain adapters, then using a non-parametric router to assign relevance and a dynamic per-sample adapter merging (DaM) to tailor predictions. The method mitigates forgetting, supports efficient adaptation to streaming datasets, and enables knowledge sharing across similar domains, achieving a 9.1% accuracy boost and 1.9% less forgetting over state-of-the-art DIL baselines on 6 VidQA datasets. Beyond VidQA, DaM generalizes to continual image classification and image QA, with substantial gains over prior methods. The approach is lightweight, scalable to multiple domains, and accompanied by public code, making it practical for real-world domain-incremental learning challenges.

Abstract

We present a parameter-efficient method for continual video question-answering (VidQA) learning. Our method, named DAM, uses the proposed Dynamic Adapter Merging to (i) mitigate catastrophic forgetting, (ii) enable efficient adaptation to continually arriving datasets, (iii) handle inputs from unknown datasets during inference, and (iv) enable knowledge sharing across similar dataset domains. Given a set of continually streaming VidQA datasets, we sequentially train dataset-specific adapters for each dataset while freezing the parameters of a large pretrained video-language backbone. During inference, given a video-question sample from an unknown domain, our method first uses the proposed non-parametric router function to compute a probability for each adapter, reflecting how relevant that adapter is to the current video-question input instance. Subsequently, the proposed dynamic adapter merging scheme aggregates all the adapter weights into a new adapter instance tailored for that particular test sample to compute the final VidQA prediction, mitigating the impact of inaccurate router predictions and facilitating knowledge sharing across domains. Our DAM model outperforms prior state-of-the-art continual learning approaches by 9.1% while exhibiting 1.9% less forgetting on 6 VidQA datasets spanning various domains. We further extend DAM to continual image classification and image QA and outperform prior methods by a large margin. The code is publicly available at: https://github.com/klauscc/DAM
Paper Structure (19 sections, 3 equations, 3 figures, 12 tables)

This paper contains 19 sections, 3 equations, 3 figures, 12 tables.

Figures (3)

  • Figure 1: A high-level overview of our proposed Domain-Incremental Learning (DIL) framework for Video Questions-Answering (VidQA). Our model is continually trained on sequentially arriving datasets and evaluated on test samples with unknown dataset identities. Our framework (i) incorporates dataset-specific modules to allow specialization and mitigate forgetting, (ii) enables efficient adaptation to continually streaming datasets, (iii) ensures robustness to incorrect module selections, and (iv) facilitates knowledge-sharing across similar datasets.
  • Figure 2: An overview of our Dynamic Adapter Merging (DaM) framework. (a) Our model is continually trained on sequentially arriving datasets $\{\mathbb{D}_1,..., \mathbb{D}_T\}$. During training on dataset $\mathbb{D}_t$, we only train the adapter $A_t = \{A_t^{(\ell)}\}_{\ell=1}^{L}$ while keeping previously learned adapters fixed. (b) During inference, given a test sample (a video and a text question), we use the proposed router to predict the probability of each adapter being relevant to that particular input instance. Afterward, we dynamically merge multiple dataset-specific adapters in parameter space to reduce the impact of incorrect router predictions and leverage cross-domain VidQA cues. Finally, the pretrained backbone, together with the merged adapter, is used to make the final VidQA predictions.
  • Figure 3: We study the normalized performance gain of dynamic adapter merging as a function of router accuracy. Our results show that dynamic adapter merging leads to a larger boost when the router is inaccurate.