Table of Contents
Fetching ...

MoETTA: Test-Time Adaptation Under Mixed Distribution Shifts with MoE-LayerNorm

Xiao Fan, Jingyan Jiang, Zhaoru Chen, Fanding Huang, Xiao Chen, Qinting Jiang, Bowen Zhang, Xing Tang, Zhi Wang

TL;DR

MoETTA tackles test-time adaptation under mixed distribution shifts by introducing a Mixture-of-Experts LayerNorm (MoE-LayerNorm) that enables multiple, distinct adaptation directions within a single model. By routing each test sample to a single expert and combining it with a shared expert, MoETTA captures diverse gradient directions while maintaining efficiency, aided by a load-balancing loss and entropy-based sample selection. The model demonstrates state-of-the-art robustness on existing mixed-shift benchmarks and the newly proposed potpourri and potpourri+ settings, while offering insights into expert diversity and scalability. This approach enhances practical deployment by accommodating heterogeneous test streams and mitigating forgetting, with modest computational overhead and broad applicability to Vision Transformers.

Abstract

Test-Time adaptation (TTA) has proven effective in mitigating performance drops under single-domain distribution shifts by updating model parameters during inference. However, real-world deployments often involve mixed distribution shifts, where test samples are affected by diverse and potentially conflicting domain factors, posing significant challenges even for SOTA TTA methods. A key limitation in existing approaches is their reliance on a unified adaptation path, which fails to account for the fact that optimal gradient directions can vary significantly across different domains. Moreover, current benchmarks focus only on synthetic or homogeneous shifts, failing to capture the complexity of real-world heterogeneous mixed distribution shifts. To address this, we propose MoETTA, a novel entropy-based TTA framework that integrates the Mixture-of-Experts (MoE) architecture. Rather than enforcing a single parameter update rule for all test samples, MoETTA introduces a set of structurally decoupled experts, enabling adaptation along diverse gradient directions. This design allows the model to better accommodate heterogeneous shifts through flexible and disentangled parameter updates. To simulate realistic deployment conditions, we introduce two new benchmarks: potpourri and potpourri+. While classical settings focus solely on synthetic corruptions, potpourri encompasses a broader range of domain shifts--including natural, artistic, and adversarial distortions--capturing more realistic deployment challenges. Additionally, potpourri+ further includes source-domain samples to evaluate robustness against catastrophic forgetting. Extensive experiments across three mixed distribution shifts settings show that MoETTA consistently outperforms strong baselines, establishing SOTA performance and highlighting the benefit of modeling multiple adaptation directions via expert-level diversity.

MoETTA: Test-Time Adaptation Under Mixed Distribution Shifts with MoE-LayerNorm

TL;DR

MoETTA tackles test-time adaptation under mixed distribution shifts by introducing a Mixture-of-Experts LayerNorm (MoE-LayerNorm) that enables multiple, distinct adaptation directions within a single model. By routing each test sample to a single expert and combining it with a shared expert, MoETTA captures diverse gradient directions while maintaining efficiency, aided by a load-balancing loss and entropy-based sample selection. The model demonstrates state-of-the-art robustness on existing mixed-shift benchmarks and the newly proposed potpourri and potpourri+ settings, while offering insights into expert diversity and scalability. This approach enhances practical deployment by accommodating heterogeneous test streams and mitigating forgetting, with modest computational overhead and broad applicability to Vision Transformers.

Abstract

Test-Time adaptation (TTA) has proven effective in mitigating performance drops under single-domain distribution shifts by updating model parameters during inference. However, real-world deployments often involve mixed distribution shifts, where test samples are affected by diverse and potentially conflicting domain factors, posing significant challenges even for SOTA TTA methods. A key limitation in existing approaches is their reliance on a unified adaptation path, which fails to account for the fact that optimal gradient directions can vary significantly across different domains. Moreover, current benchmarks focus only on synthetic or homogeneous shifts, failing to capture the complexity of real-world heterogeneous mixed distribution shifts. To address this, we propose MoETTA, a novel entropy-based TTA framework that integrates the Mixture-of-Experts (MoE) architecture. Rather than enforcing a single parameter update rule for all test samples, MoETTA introduces a set of structurally decoupled experts, enabling adaptation along diverse gradient directions. This design allows the model to better accommodate heterogeneous shifts through flexible and disentangled parameter updates. To simulate realistic deployment conditions, we introduce two new benchmarks: potpourri and potpourri+. While classical settings focus solely on synthetic corruptions, potpourri encompasses a broader range of domain shifts--including natural, artistic, and adversarial distortions--capturing more realistic deployment challenges. Additionally, potpourri+ further includes source-domain samples to evaluate robustness against catastrophic forgetting. Extensive experiments across three mixed distribution shifts settings show that MoETTA consistently outperforms strong baselines, establishing SOTA performance and highlighting the benefit of modeling multiple adaptation directions via expert-level diversity.

Paper Structure

This paper contains 33 sections, 2 theorems, 32 equations, 9 figures, 12 tables.

Key Result

Proposition 1

Let $\theta_1,\theta_2,\theta\in\mathbb{R}^d$ be independent random vectors sampled as and define Then, the expected cosine similarity satisfies Moreover, this expectation is independent of $\sigma$.

Figures (9)

  • Figure 1: Cosine similarity heatmap of accumulated gradient directions $\theta_i - \theta_{\text{pre}}$, where $\theta_i$ denotes the adapted model parameters on the $i$-th domain of ImageNet-C imagenet-c using Tent Tent, and $\theta_{\text{pre}}$ is the pre-trained model parameters. Each entry at position $(i, j)$ represents the cosine similarity between adaptation directions from domains $i$ and $j$. The average cosine similarity in the lower triangle is 0.69, suggesting substantial variation across domains and highlighting the limitation of using a single adaptation direction under mixed distribution shifts.
  • Figure 1: Overview of the corruption types and benchmark compositions used in our evaluation. The left block shows the 15 corruption types from ImageNet-C, grouped into four categories: noise, blur, weather, and digital. While classical mixed distribution shifts rely solely on these corruptions, our proposed Potpourri benchmark extends the evaluation to include samples from ImageNet-R (renditions), ImageNet-Sketch (sketches), and ImageNet-A (adversarial hard samples), thereby increasing semantic and stylistic diversity. Potpourri+ further includes clean validation images from the original ImageNet dataset to simulate real-world data streams that occasionally contain in-distribution (ID) samples.
  • Figure 2: Method Overview. We replace the LayerNorm modules in the encoder blocks of a Vision Transformer (ViT) ViT with our proposed MoE-LayerNorm. Colors of the embeddings and their routed components are used illustratively to suggest that the samples originate from different domains. For each input embedding, we first compute the mean across the token (sequence) dimension. This averaged vector is fed into a router to obtain routing probabilities, and the expert with the highest probability is selected. Its parameters are then added to the frozen pre-trained LayerNorm parameters to form a sample-specific LayerNorm. Finally, each token embedding is normalized using this customized LayerNorm. A PyTorch-style pseudo code of the forward pass for MoE-LayerNorm can be found in App. \ref{['app:pseudo_code']}.
  • Figure 2: Cosine similarity between the expert weights within each MoE-LayerNorm layer after adaptation under the potpourri+ setting with corruption level 5. Each heatmap corresponds to one MoE-LayerNorm layer.
  • Figure 3: t-SNE tsne projection of CLS token embeddings $\bm{z}^{0}_{L}$ in Eq. \ref{['eq:final_rep']}, from ViT-B/16 on 7,500 samples per dataset from ImageNet-A, -R, -Sketch, and -C. While classical mixed distribution shifts (ImageNet-C only) occupy a relatively narrow region (gray), our proposed potpourri benchmark introduces greater semantic and stylistic diversity by incorporating three additional variants. This results in a more heterogeneous and realistic evaluation setting for TTA.
  • ...and 4 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Proposition 2
  • proof