Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

Rongyu Zhang; Aosong Cheng; Yulin Luo; Gaole Dai; Huanrui Yang; Jiaming Liu; Ran Xu; Li Du; Yuan Du; Yanbing Jiang; Shanghang Zhang

Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

Rongyu Zhang, Aosong Cheng, Yulin Luo, Gaole Dai, Huanrui Yang, Jiaming Liu, Ran Xu, Li Du, Yuan Du, Yanbing Jiang, Shanghang Zhang

TL;DR

This work tackles continual test-time adaptation (CTTA) by introducing MoASE, a Mixture-of-Activation-Sparsity-Experts adapter that explicitly decomposes neural activations into high-activation (domain-agnostic shape cues) and low-activation (domain-specific texture cues) components using Spatial Differentiate Dropout. A multi-gate routing mechanism, comprising Domain-Aware Gate (DAG) and Activation-Sparsity Gate (ASG), dynamically allocates and thresholds activations across a set of experts, enabling robust adaptation to evolving target domains. The approach is regularized by a Homeostatic-Proximal loss within a teacher-student framework to mitigate error accumulation, and it is validated on classification and segmentation CTTA benchmarks where it achieves state-of-the-art results. The work demonstrates better cross-domain representation with reduced inter-domain divergence and improved intra-class cohesion, offering practical benefits for real-world continual adaptation under resource constraints.

Abstract

Continual Test-Time Adaptation (CTTA), which aims to adapt the pre-trained model to ever-evolving target domains, emerges as an important task for vision models. As current vision models appear to be heavily biased towards texture, continuously adapting the model from one domain distribution to another can result in serious catastrophic forgetting. Drawing inspiration from the human visual system's adeptness at processing both shape and texture according to the famous Trichromatic Theory, we explore the integration of a Mixture-of-Activation-Sparsity-Experts (MoASE) as an adapter for the CTTA task. Given the distinct reaction of neurons with low/high activation to domain-specific/agnostic features, MoASE decomposes the neural activation into high-activation and low-activation components with a non-differentiable Spatial Differentiate Dropout (SDD). Based on the decomposition, we devise a multi-gate structure comprising a Domain-Aware Gate (DAG) that utilizes domain information to adaptive combine experts that process the post-SDD sparse activations of different strengths, and the Activation Sparsity Gate (ASG) that adaptively assigned feature selection threshold of the SDD for different experts for more precise feature decomposition. Finally, we introduce a Homeostatic-Proximal (HP) loss to bypass the error accumulation problem when continuously adapting the model. Extensive experiments on four prominent benchmarks substantiate that our methodology achieves state-of-the-art performance in both classification and segmentation CTTA tasks. Our code is now available at https://github.com/RoyZry98/MoASE-Pytorch.

Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

TL;DR

Abstract

Paper Structure (21 sections, 1 theorem, 11 equations, 6 figures, 11 tables)

This paper contains 21 sections, 1 theorem, 11 equations, 6 figures, 11 tables.

Introduction
Related works
Motivation
Methods
Preliminary
Mixture-of-Activation-Sparsity-Experts
Optimization objective
Justification
Experiments
Quantitative analysis
Ablation study
Qualitative analysis
Conclusion and limitations
A bound relating the source and target error
Detailed experiment settings
...and 6 more sections

Key Result

Theorem 1

For a hypothesis h,

Figures (6)

Figure 1: The problem and motivation. Our goal is to effectively adapt the source pre-trained model to continually changing target domains. We propose a Mixture-of-Experts (MoE) based to encode the different features of texture and shape with different experts. Our design is inspired by the photoreceptor cells in the human visual system, where the three types of cone cells are sensitive to different wavelengths of light and the Fovea.
Figure 2: The visualization analysis of the Class Activation Map (CAM). We adopt CAM to compare the attention of the low-activation MoASE, high-activation MoASE, and the original model during the continual adaptation process.
Figure 3: The overall framework of Mixture-of-Activation-Sparsity-Experts (MoASE). (Left) We integrate the MoASE into the linear layers of a pre-trained source model with a teacher-student framework, using consistency loss and a specially formulated Homeostatic-Proximal (HP) loss as the optimization target. (Right) Depending on the degree of distribution shift, we devised a multi-router structure that includes the Domain Aware Gate (DAG) and the Activation Sparsity Gate (ASG).
Figure 4: Inter-domain and intra-class distance. The X-axis displays the 15 corruption domains in CIFAR10-C, listed in sequential order. (a) MoASE more effectively reduces inter-domain divergence than the source model across all 14 domain shifts. (b) MoASE significantly improves intra-class feature aggregation, producing results that closely align with those of our proposed method.
Figure 5: The qualitative analysis of the CAM and the segmentation qualitative comparison of our method with previous SOTA methods on the ACDC dataset.
...and 1 more figures

Theorems & Definitions (2)

Theorem 1
Proof 1

Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

TL;DR

Abstract

Decomposing the Neurons: Activation Sparsity via Mixture of Experts for Continual Test Time Adaptation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)