One Pool Is Not Enough: Multi-Cluster Memory for Practical Test-Time Adaptation

Yu-Wen Tseng; Xingyi Zheng; Ya-Chen Wu; I-Bin Liao; Yung-Hui Li; Hong-Han Shuai; Wen-Huang Cheng

One Pool Is Not Enough: Multi-Cluster Memory for Practical Test-Time Adaptation

Yu-Wen Tseng, Xingyi Zheng, Ya-Chen Wu, I-Bin Liao, Yung-Hui Li, Hong-Han Shuai, Wen-Huang Cheng

Abstract

Test-time adaptation (TTA) adapts pre-trained models to distribution shifts at inference using only unlabeled test data. Under the Practical TTA (PTTA) setting, where test streams are temporally correlated and non-i.i.d., memory has become an indispensable component for stable adaptation, yet existing methods universally store amples in a single unstructured pool. We show that this single-cluster design is fundamentally mismatched to PTTA: a stream clusterability analysis reveals that test streams are inherently multi-modal, with the optimal number of mixture components consistently far exceeding one. To close this structural gap, we propose Multi-Cluster Memory (MCM), a plug-and-play framework that organizes stored samples into multiple clusters using lightweight pixel-level statistical descriptors. MCM introduces three complementary mechanisms: descriptor-based cluster assignment to capture distinct distributional modes, Adjacent Cluster Consolidation (ACC) to bound memory usage by merging the most similar temporally adjacent clusters, and Uniform Cluster Retrieval (UCR) to ensure balanced supervision across all modes during adaptation. Integrated with three contemporary TTA methods on CIFAR-10-C, CIFAR-100-C, ImageNet-C, and DomainNet, MCM achieves consistent improvements across all 12 configurations, with gains up to 5.00% on ImageNet-C and 12.13% on DomainNet. Notably, these gains scale with distributional complexity: larger label spaces with greater multi-modality benefit most from multi-cluster organization. GMM-based memory diagnostics further confirm that MCM maintains near-optimal distributional balance, entropy, and mode coverage, whereas single-cluster memory exhibits persistent imbalance and progressive mode loss. These results establish memory organization as a key design axis for practical test-time adaptation.

One Pool Is Not Enough: Multi-Cluster Memory for Practical Test-Time Adaptation

Abstract

Paper Structure (38 sections, 6 equations, 6 figures, 9 tables)

This paper contains 38 sections, 6 equations, 6 figures, 9 tables.

Introduction
Related Work
Methodology
Revisiting Memory-based Test-Time Adaptation
Test-time Adaptation System with Multi-Cluster Memory
Adjacent Cluster Consolidation (ACC)
Uniform Cluster Retrieval (UCR)
Experiments
Setup and Protocols
Datasets and Metrics.
Implementation Details.
Baselines.
Main Results
Consistent improvements across baselines.
Scaling with distributional complexity.
...and 23 more sections

Figures (6)

Figure 1: Motivation for multi-cluster memory. (a) Stream clusterability analysis on CIFAR-100-C (PTTA): we fit GMMs with varying $K$ to sliding windows of the test stream and select the optimal $K^*$ via BIC across three descriptor types. The consistently high $K^*$ values ($\mu_{K^*}$ = 5.9--9.7) confirm that the target distribution is inherently multi-modal, far exceeding the $K\!=\!1$ assumption of single-cluster memory. (b) Under the same total capacity, SCM samples concentrate around similar regions of the descriptor space, whereas MCM distributes samples across distinct modes, extending coverage to under-represented regions (highlighted in red circle).
Figure 2: Overview of the TTA system with Multi-Cluster Memory (MCM). Incoming samples are assigned to clusters via pixel-level descriptors (left). Uniform Cluster Retrieval (UCR) draws balanced samples across all clusters for adaptation (center). Adjacent Cluster Consolidation (ACC) merges the closest temporally adjacent pair when capacity is reached (right). The three stages jointly preserve the multi-modal structure of the target stream under bounded memory.
Figure 3: Memory scaling comparison on CIFAR-100-C (PTTA). Bars denote error rate; lines denote runtime. For MCM, per-cluster capacity is fixed at 64 and total capacity is varied by the number of clusters. Across all three baselines, simply enlarging the single-cluster pool increases runtime with negligible accuracy gain, whereas MCM consistently achieves lower error at lower cost under equal total capacity. PeTTA with SCM at 256 and 320 samples encountered out-of-memory errors (middle panel).
Figure 4: Diagnostic comparison of memory quality between SCM and MCM over the CIFAR-100-C stream (PTTA, PeTTA). We fit a GMM to the evolving stream and measure three properties of the stored memory: (a) imbalance ratio (lower is better), (b) distributional entropy (higher is more uniform), and (c) cluster coverage (fraction of GMM components with $>$1% representation). MCM maintains near-constant balance, entropy, and coverage throughout adaptation, whereas SCM exhibits high variance and progressive degradation.
Figure S1: Sensitivity of $K_{\max}$ on CIFAR-100-C (PTTA, severity 5) with PeTTA+MCM.Left: average error rate (%) vs. $K_{\max}$, showing a U-shaped trend with minimum at $K_{\max}{=}4$ (dashed red line). Right: per-corruption error heatmap (green = lower error). Most corruptions favor moderate $K_{\max}$ (3--5), while contrast and impulse noise are most sensitive to this parameter.
...and 1 more figures

One Pool Is Not Enough: Multi-Cluster Memory for Practical Test-Time Adaptation

Abstract

One Pool Is Not Enough: Multi-Cluster Memory for Practical Test-Time Adaptation

Authors

Abstract

Table of Contents

Figures (6)