Table of Contents
Fetching ...

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

Yongbo He, Zirun Guo, Tao Jin

TL;DR

Decoupling Adaptation for Stability and Plasticity (DASP) is proposed, a novel diagnose-then-mitigate framework that significantly outperforms state-of-the-art methods in multi-modal test-time adaptation.

Abstract

Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.

Decoupling Stability and Plasticity for Multi-Modal Test-Time Adaptation

TL;DR

Decoupling Adaptation for Stability and Plasticity (DASP) is proposed, a novel diagnose-then-mitigate framework that significantly outperforms state-of-the-art methods in multi-modal test-time adaptation.

Abstract

Adapting pretrained multi-modal models to evolving test-time distributions, known as multi-modal test-time adaptation, presents a significant challenge. Existing methods frequently encounter negative transfer in the unbiased modality and catastrophic forgetting in the biased modality. To address these challenges, we propose Decoupling Adaptation for Stability and Plasticity (DASP), a novel diagnose-then-mitigate framework. Our analysis reveals a critical discrepancy within the unified latent space: the biased modality exhibits substantially higher interdimensional redundancy (i.e., strong correlations across feature dimensions) compared to the unbiased modality. Leveraging this insight, DASP identifies the biased modality and implements an asymmetric adaptation strategy. This strategy employs a decoupled architecture where each modality-specific adapter is divided into stable and plastic components. The asymmetric mechanism works as follows: for the biased modality, which requires plasticity, the plastic component is activated and updated to capture domain-specific information, while the stable component remains fixed. Conversely, for the unbiased modality, which requires stability, the plastic component is bypassed, and the stable component is updated using KL regularization to prevent negative transfer. This asymmetric design enables the model to adapt flexibly to new domains while preserving generalizable knowledge. Comprehensive evaluations on diverse multi-modal benchmarks demonstrate that DASP significantly outperforms state-of-the-art methods.
Paper Structure (16 sections, 13 equations, 9 figures, 10 tables)

This paper contains 16 sections, 13 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Limitations in Multi-Modal TTA. We evaluate changes in source domain performance during continual adaptation, measured as $\Delta=\mathrm{Acc}_\text{orignal}-\mathrm{Acc}_\text{adapted}$, for state-of-the-art methods (READ and TSA). Results indicate ongoing degradation in both multi-modal and uni-modal contexts. Performance drops in the biased modality are referred to as catastrophic forgetting, while drops in the unbiased modality are considered negative transfer.
  • Figure 2: Entropy and confidence statistics on the VGGSound-C with corrupted audio modality. Since audio serves as the dominant modality in this dataset, it continues to display lower entropy and greater confidence, even in the presence of distribution shifts.
  • Figure 3: Redundancy statistics on Kinetics50-C and VGGSound-C. The corrupted modality demonstrates increased redundancy in feature embeddings. Furthermore, the results underscore a significant correlation between redundancy and accuracy.
  • Figure 4: The overview of our proposed DASP features a diagnose-then-mitigate framework. It begins by diagnosing the biased modality through redundancy scores, followed by asymmetric adaptation that includes modality-specific updates guided by entropy minimization.
  • Figure 5: Sensitivity Analysis of Hyper-parameters: Batch Size ($\mathbf{B}$), Redundancy Threshold ($\mathbf{\delta}$) and Loss Coefficents ($\lambda_{\text{ent}}$, $\lambda_{\text{kl}}$).
  • ...and 4 more figures

Theorems & Definitions (1)

  • Definition 1: Interdimensional Redundancy