Table of Contents
Fetching ...

Attention Bootstrapping for Multi-Modal Test-Time Adaptation

Yusheng Zhao, Junyu Luo, Xiao Luo, Jinsheng Huang, Jingyang Yuan, Zhiping Xiao, Ming Zhang

TL;DR

This work tackles multi-modal test-time adaptation under distribution shifts by addressing the attention gap between self- and cross-attention in fused modalities. It introduces ABPEM, which combines attention bootstrapping—using self-attention as an anchor to improve cross-attention—and principal entropy minimization to reduce gradient noise by focusing on the most reliable classes. The method minimizes a joint objective $\mathcal{L} = \lambda \mathcal{L}_{AB} + \mathcal{L}_{PEM}$ and operates with the same computational footprint as the base model, enabling efficient online adaptation. Experiments on Kinetics50-C and VGGSound-C demonstrate consistent gains across corruption types, with ablations validating the importance of both components for robust, label-free multi-modal test-time adaptation. The approach offers a practical pathway for reliable multi-modal fusion under real-world distribution shifts.

Abstract

Test-time adaptation aims to adapt a well-trained model to potential distribution shifts at test time using only unlabeled test data, without access to the original training data. While previous efforts mainly focus on a single modality, test-time distribution shift in the multi-modal setting is more complex and calls for new solutions. This paper tackles the problem of multi-modal test-time adaptation by proposing a novel method named Attention Bootstrapping with Principal Entropy Minimization (ABPEM). We observe that test-time distribution shift causes misalignment across modalities, leading to a large gap between intra-modality discrepancies (measured by self-attention) and inter-modality discrepancies (measured by cross-attention). We name this the attention gap. This attention gap widens with more severe distribution shifts, hindering effective modality fusion. To mitigate this attention gap and encourage better modality fusion, we propose attention bootstrapping that promotes cross-attention with the guidance of self-attention. Moreover, to reduce the gradient noise in the commonly-used entropy minimization, we adopt principal entropy minimization, a refinement of entropy minimization that reduces gradient noise by focusing on the principal parts of entropy, excluding less reliable gradient information. Extensive experiments on the benchmarks validate the effectiveness of the proposed ABPEM in comparison with competing baselines.

Attention Bootstrapping for Multi-Modal Test-Time Adaptation

TL;DR

This work tackles multi-modal test-time adaptation under distribution shifts by addressing the attention gap between self- and cross-attention in fused modalities. It introduces ABPEM, which combines attention bootstrapping—using self-attention as an anchor to improve cross-attention—and principal entropy minimization to reduce gradient noise by focusing on the most reliable classes. The method minimizes a joint objective and operates with the same computational footprint as the base model, enabling efficient online adaptation. Experiments on Kinetics50-C and VGGSound-C demonstrate consistent gains across corruption types, with ablations validating the importance of both components for robust, label-free multi-modal test-time adaptation. The approach offers a practical pathway for reliable multi-modal fusion under real-world distribution shifts.

Abstract

Test-time adaptation aims to adapt a well-trained model to potential distribution shifts at test time using only unlabeled test data, without access to the original training data. While previous efforts mainly focus on a single modality, test-time distribution shift in the multi-modal setting is more complex and calls for new solutions. This paper tackles the problem of multi-modal test-time adaptation by proposing a novel method named Attention Bootstrapping with Principal Entropy Minimization (ABPEM). We observe that test-time distribution shift causes misalignment across modalities, leading to a large gap between intra-modality discrepancies (measured by self-attention) and inter-modality discrepancies (measured by cross-attention). We name this the attention gap. This attention gap widens with more severe distribution shifts, hindering effective modality fusion. To mitigate this attention gap and encourage better modality fusion, we propose attention bootstrapping that promotes cross-attention with the guidance of self-attention. Moreover, to reduce the gradient noise in the commonly-used entropy minimization, we adopt principal entropy minimization, a refinement of entropy minimization that reduces gradient noise by focusing on the principal parts of entropy, excluding less reliable gradient information. Extensive experiments on the benchmarks validate the effectiveness of the proposed ABPEM in comparison with competing baselines.

Paper Structure

This paper contains 15 sections, 13 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: During test time, the distribution shift typically has a larger impact on the inter-modality discrepancy than intra-modality discrepancy, leading to an increasing attention gap.
  • Figure 2: The framework of the proposed ABPEM.
  • Figure 3: As the test-time distribution shift becomes severer, the attention gap (blue bar plot) tends to increase, and the prediction accuracy (orange line plot) tends to decrease.
  • Figure 4: The test-time mean error increases with the rank of the class. Classes with lower ranks are more robust to test-time distribution shift (lower errors).
  • Figure 5: Ablation of the main components of ABPEM.
  • ...and 2 more figures