Audio-Visual Continual Test-Time Adaptation without Forgetting
Sarthak Kumar Maharana, Akshay Mehra, Bhavya Ramakrishna, Yunhui Guo, Guan-Ming Su
TL;DR
This paper tackles audio-visual continual test-time adaptation (AV-CTTA) under unlabeled, non-stationary test distributions and catastrophic forgetting. It uncovers that adapting only the fusion-layer attention parameters $(\mathcal{W_Q}, \mathcal{W_K}, \mathcal{W_V})$ yields strong intra- and cross-domain transfer, motivating a source-free approach that reuses prior adaptation through a shared buffer. The proposed AV-CTTA method stores modality-specific input statistics and fusion parameters as buffer elements and retrieves the most relevant state via a distance metric on current statistics, with EMA updates to ensure smooth transfer. Extensive experiments on unimodal and bimodal corruptions demonstrate state-of-the-art performance and substantial mitigation of forgetting, with robust behavior across buffer budgets and task orders. The work offers a practical, memory-efficient pathway to deploy audio-visual models in continually changing environments without access to source data.
Abstract
Audio-visual continual test-time adaptation involves continually adapting a source audio-visual model at test-time, to unlabeled non-stationary domains, where either or both modalities can be distributionally shifted, which hampers online cross-modal learning and eventually leads to poor accuracy. While previous works have tackled this problem, we find that SOTA methods suffer from catastrophic forgetting, where the model's performance drops well below the source model due to continual parameter updates at test-time. In this work, we first show that adapting only the modality fusion layer to a target domain not only improves performance on that domain but can also enhance performance on subsequent domains. Based on this strong cross-task transferability of the fusion layer's parameters, we propose a method, $\texttt{AV-CTTA}$, that improves test-time performance of the models without access to any source data. Our approach works by using a selective parameter retrieval mechanism that dynamically retrieves the best fusion layer parameters from a buffer using only a small batch of test data. These parameters are then integrated into the model, adapted to the current test distribution, and saved back for future use. Extensive experiments on benchmark datasets involving unimodal and bimodal corruptions show our proposed $\texttt{AV-CTTA}$ significantly outperforms existing methods while minimizing catastrophic forgetting.
