Table of Contents
Fetching ...

A Layer Selection Approach to Test Time Adaptation

Sabyasachi Sahoo, Mostafa ElAraby, Jonas Ngnawe, Yann Pequignot, Frederic Precioso, Christian Gagne

TL;DR

The paper tackles distribution shift in Test Time Adaptation by showing that not all network layers respond equally to adaptation and that misaligned gradient updates can degrade performance. It introduces Gradient-Aligned Layer Adaptation (GALA), a cosine-distance–based layer selection criterion that ranks layers by gradient alignment and applies a binary mask to update only the most beneficial layer per sample, with a reset window to accommodate direction changes. Through extensive experiments on DomainBed and Continual TTA benchmarks, GALA consistently outperforms ERM and all-layers baselines across backbones and losses, and approaches or surpasses oracle layer strategies without requiring target labels. The results reveal that good layers vary with shift and loss, and that the reset mechanism further boosts performance in multi-domain settings, highlighting GALA’s practical potential as a robust, flexible plug-in for TTA systems.

Abstract

Test Time Adaptation (TTA) addresses the problem of distribution shift by adapting a pretrained model to a new domain during inference. When faced with challenging shifts, most methods collapse and perform worse than the original pretrained model. In this paper, we find that not all layers are equally receptive to the adaptation, and the layers with the most misaligned gradients often cause performance degradation. To address this, we propose GALA, a novel layer selection criterion to identify the most beneficial updates to perform during test time adaptation. This criterion can also filter out unreliable samples with noisy gradients. Its simplicity allows seamless integration with existing TTA loss functions, thereby preventing degradation and focusing adaptation on the most trainable layers. This approach also helps to regularize adaptation to preserve the pretrained features, which are crucial for handling unseen domains. Through extensive experiments, we demonstrate that the proposed layer selection framework improves the performance of existing TTA approaches across multiple datasets, domain shifts, model architectures, and TTA losses.

A Layer Selection Approach to Test Time Adaptation

TL;DR

The paper tackles distribution shift in Test Time Adaptation by showing that not all network layers respond equally to adaptation and that misaligned gradient updates can degrade performance. It introduces Gradient-Aligned Layer Adaptation (GALA), a cosine-distance–based layer selection criterion that ranks layers by gradient alignment and applies a binary mask to update only the most beneficial layer per sample, with a reset window to accommodate direction changes. Through extensive experiments on DomainBed and Continual TTA benchmarks, GALA consistently outperforms ERM and all-layers baselines across backbones and losses, and approaches or surpasses oracle layer strategies without requiring target labels. The results reveal that good layers vary with shift and loss, and that the reset mechanism further boosts performance in multi-domain settings, highlighting GALA’s practical potential as a robust, flexible plug-in for TTA systems.

Abstract

Test Time Adaptation (TTA) addresses the problem of distribution shift by adapting a pretrained model to a new domain during inference. When faced with challenging shifts, most methods collapse and perform worse than the original pretrained model. In this paper, we find that not all layers are equally receptive to the adaptation, and the layers with the most misaligned gradients often cause performance degradation. To address this, we propose GALA, a novel layer selection criterion to identify the most beneficial updates to perform during test time adaptation. This criterion can also filter out unreliable samples with noisy gradients. Its simplicity allows seamless integration with existing TTA loss functions, thereby preventing degradation and focusing adaptation on the most trainable layers. This approach also helps to regularize adaptation to preserve the pretrained features, which are crucial for handling unseen domains. Through extensive experiments, we demonstrate that the proposed layer selection framework improves the performance of existing TTA approaches across multiple datasets, domain shifts, model architectures, and TTA losses.
Paper Structure (48 sections, 11 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 48 sections, 11 equations, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Intuition for proposed approaches: (a) As the model reaches closer to minima, the individual sample gradients start to be misaligned with gradients of previous samples mahsereci2017earlyforouzesh2021disparityagarwal2022estimating. We leverage this misalignment to identify trainable layers. (b) While effective in moving in the direction of most aligned gradients, the introduced criterion based on angular deviation could prevent adaptation when a direction change is needed, even if the following updates (or gradients) are aligned. A reset of the past horizon (i.e., gradients of previous samples) considered in the alignment condition can help resolve such situations.
  • Figure 2: Gradient-Aligned Layer Adaptation or GALA framework adapts the most gradient-aligned layer per sample. It adapts all the layers for the first sample in a reset window (e.g., $x_1, x_n, \dots$). For all the other samples, it adapts the most gradient-aligned layer per sample. It can also skip the adaptation on a given sample if all the layers are misaligned. We use a reset window to periodically reset the anchor parameters to allow for a change in direction.
  • Figure 3: Illustration of proposed criterion based on angular deviation. Different layers can be ranked based on their alignments with previous gradient updates. In the figure, updates drawn in red are discarded, while green updates are applied, adding up to $\mathbf{TD}_{i-1}$. The update under scrutiny $\mathbf{u_i}$ is drawn in cyan, and its sum with $\mathbf{TD}_{i-1}$ is drawn in blue. Application of update $\mathbf{u_i}$ or not is based on the angle $\alpha_i$.
  • Figure 4: Heatmap of Performance improvement (%) per-block on Domainbed benchmark. Performance improvement is the difference between the TTA accuracy of a given block/layer and ERM accuracy for the same shift. Positive performance improvements are shown in green, and negative performance improvements (or degradation) are in red. Using the bounding box, we highlight the best block per loss and dataset shift. Further details in Sec. \ref{['sec:study']}.
  • Figure 5: Effect of magnitude of $u$ on cosine distance criterion. Left: Consider two vectors such that $u_1$ is smaller than $u_2$ but is better aligned with its displacement. For large displacements ($T$), alignment becomes crucial and GALA selects $u_2$. For small displacements ($T^{'}$), the update’s magnitude can dominate the criterion, and GALA selects $u_1$. Middle and Right: Plot of cosine metric values with level curves. Alignment prevails for small updates compared to the total displacement (Middle). But, for updates with large magnitude compared to total displacement (Right), large cosine values can be obtained even for misaligned updates.
  • ...and 3 more figures