Table of Contents
Fetching ...

Audio-Visual Continual Test-Time Adaptation without Forgetting

Sarthak Kumar Maharana, Akshay Mehra, Bhavya Ramakrishna, Yunhui Guo, Guan-Ming Su

TL;DR

This paper tackles audio-visual continual test-time adaptation (AV-CTTA) under unlabeled, non-stationary test distributions and catastrophic forgetting. It uncovers that adapting only the fusion-layer attention parameters $(\mathcal{W_Q}, \mathcal{W_K}, \mathcal{W_V})$ yields strong intra- and cross-domain transfer, motivating a source-free approach that reuses prior adaptation through a shared buffer. The proposed AV-CTTA method stores modality-specific input statistics and fusion parameters as buffer elements and retrieves the most relevant state via a distance metric on current statistics, with EMA updates to ensure smooth transfer. Extensive experiments on unimodal and bimodal corruptions demonstrate state-of-the-art performance and substantial mitigation of forgetting, with robust behavior across buffer budgets and task orders. The work offers a practical, memory-efficient pathway to deploy audio-visual models in continually changing environments without access to source data.

Abstract

Audio-visual continual test-time adaptation involves continually adapting a source audio-visual model at test-time, to unlabeled non-stationary domains, where either or both modalities can be distributionally shifted, which hampers online cross-modal learning and eventually leads to poor accuracy. While previous works have tackled this problem, we find that SOTA methods suffer from catastrophic forgetting, where the model's performance drops well below the source model due to continual parameter updates at test-time. In this work, we first show that adapting only the modality fusion layer to a target domain not only improves performance on that domain but can also enhance performance on subsequent domains. Based on this strong cross-task transferability of the fusion layer's parameters, we propose a method, $\texttt{AV-CTTA}$, that improves test-time performance of the models without access to any source data. Our approach works by using a selective parameter retrieval mechanism that dynamically retrieves the best fusion layer parameters from a buffer using only a small batch of test data. These parameters are then integrated into the model, adapted to the current test distribution, and saved back for future use. Extensive experiments on benchmark datasets involving unimodal and bimodal corruptions show our proposed $\texttt{AV-CTTA}$ significantly outperforms existing methods while minimizing catastrophic forgetting.

Audio-Visual Continual Test-Time Adaptation without Forgetting

TL;DR

This paper tackles audio-visual continual test-time adaptation (AV-CTTA) under unlabeled, non-stationary test distributions and catastrophic forgetting. It uncovers that adapting only the fusion-layer attention parameters yields strong intra- and cross-domain transfer, motivating a source-free approach that reuses prior adaptation through a shared buffer. The proposed AV-CTTA method stores modality-specific input statistics and fusion parameters as buffer elements and retrieves the most relevant state via a distance metric on current statistics, with EMA updates to ensure smooth transfer. Extensive experiments on unimodal and bimodal corruptions demonstrate state-of-the-art performance and substantial mitigation of forgetting, with robust behavior across buffer budgets and task orders. The work offers a practical, memory-efficient pathway to deploy audio-visual models in continually changing environments without access to source data.

Abstract

Audio-visual continual test-time adaptation involves continually adapting a source audio-visual model at test-time, to unlabeled non-stationary domains, where either or both modalities can be distributionally shifted, which hampers online cross-modal learning and eventually leads to poor accuracy. While previous works have tackled this problem, we find that SOTA methods suffer from catastrophic forgetting, where the model's performance drops well below the source model due to continual parameter updates at test-time. In this work, we first show that adapting only the modality fusion layer to a target domain not only improves performance on that domain but can also enhance performance on subsequent domains. Based on this strong cross-task transferability of the fusion layer's parameters, we propose a method, , that improves test-time performance of the models without access to any source data. Our approach works by using a selective parameter retrieval mechanism that dynamically retrieves the best fusion layer parameters from a buffer using only a small batch of test data. These parameters are then integrated into the model, adapted to the current test distribution, and saved back for future use. Extensive experiments on benchmark datasets involving unimodal and bimodal corruptions show our proposed significantly outperforms existing methods while minimizing catastrophic forgetting.
Paper Structure (26 sections, 13 equations, 10 figures, 7 tables, 1 algorithm)

This paper contains 26 sections, 13 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: We illustrate audio-visual continual test-time adaptation using an example of a deployed agent with audio-visual sensors for scene understanding. Starting from a source model parameterized by $\theta^S$, the agent encounters a sequence of evolving target environments where distributional shifts may affect the audio modality, the visual modality, or both, motivating continual adaptation at test-time. The goal is to maintain robust performance without access to the source data and task boundaries.
  • Figure 2: $\texttt{AV-CTTA}$ achieves SOTA performance on audio-visual CTTA. We report task-wise accuracy on VGGSound-2C maharana2025texttt at severity level 5 under correlated bimodal corruptions in the continual setting. CAV-MAE gong2022contrastive is used as the SOURCE model. We extend TTA (TENT, EATA, SAR) and AV-TTA (READ, PTA, SuMi, BriMPR*) methods to the continual setting. Existing AV-TTA methods struggle under severe correlated bimodal corruptions.
  • Figure 3: Attention fusion layer adapted on a single domain successfully transfers, achieving performance exceeding or comparable to the source model, motivating us to store parameter snapshots in a buffer that can be reused during audio-visual CTTA. We adapt the projection matrices $\{\mathcal{W}_Q, \mathcal{W}_K, \mathcal{W}_V\}$ of the joint encoder $f_j$ of a pre-trained CAV-MAE (SOURCE) gong2022contrastive on the first unseen domain of each corruption category, following READ yang2024test. This adapted state is then frozen and inferred on the remaining sequence of unseen domains. We report the accuracy change $\Delta$ (in %) over SOURCE.
  • Figure 4: Illustration of $\texttt{AV-CTTA}$. At time-step $t$, audio-visual inputs via $\mu_a^t, \Sigma_a^t, \mu_v^t, \Sigma_v^t$ are modeled as Gaussian distributions. The selection retrieval stage uses KL divergence $g(\cdot)$ to compare current statistics against all elements in the shared buffer $\mathcal{K}$ ($\mathcal{M}$ is the current set of indices). For the best match within threshold $\tau$, stored parameters ($\mathcal{W}_Q, \mathcal{W}_K, \mathcal{W}_V$) are retrieved for adaptation at time-step $t$. If not, the buffer expansion stage involves adding current statistics and parameters in $\mathcal{K}$. Redundant elements are merged to maintain a memory budget $\eta$. Continual adaptation proceeds with current parameters, i.e., from time-step $t-1$.
  • Figure 5: $\texttt{AV-CTTA}$ achieves SOTA results on Kinetics50-2C (left) and VGGSound-2C (right). We report mean accuracy (%) at a severity level of 5 in the continual setting. Here, buffer size $\eta=\infty$.
  • ...and 5 more figures