Table of Contents
Fetching ...

StableTTA: Training-Free Test-Time Adaptation that Improves Model Accuracy on ImageNet1K to 96%

Zheng Li, Jerry Cheng, Huanying Helen Gu

Abstract

Ensemble methods are widely used to improve predictive performance, but their effectiveness often comes at the cost of increased memory usage and computational complexity. In this paper, we identify a conflict in aggregation strategies that negatively impacts prediction stability. We propose StableTTA, a training-free method to improve aggregation stability and efficiency. Empirical results on ImageNet-1K show gains of 10.93--32.82\% in top-1 accuracy, with 33 models achieving over 95\% accuracy and several surpassing 96\%. Notably, StableTTA allows lightweight architectures to outperform ViT by 11.75\% in top-1 accuracy while using less than 5\% of parameters and reducing computational cost by approximately 89.1\% (in GFLOPs), enabling high-accuracy inference on resource-constrained devices.

StableTTA: Training-Free Test-Time Adaptation that Improves Model Accuracy on ImageNet1K to 96%

Abstract

Ensemble methods are widely used to improve predictive performance, but their effectiveness often comes at the cost of increased memory usage and computational complexity. In this paper, we identify a conflict in aggregation strategies that negatively impacts prediction stability. We propose StableTTA, a training-free method to improve aggregation stability and efficiency. Empirical results on ImageNet-1K show gains of 10.93--32.82\% in top-1 accuracy, with 33 models achieving over 95\% accuracy and several surpassing 96\%. Notably, StableTTA allows lightweight architectures to outperform ViT by 11.75\% in top-1 accuracy while using less than 5\% of parameters and reducing computational cost by approximately 89.1\% (in GFLOPs), enabling high-accuracy inference on resource-constrained devices.

Paper Structure

This paper contains 23 sections, 22 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Top: Milestone comparison. We show that StableTTA+MobileNetV3 significantly outperforms the base ViT in terms of accuracy (+11.75% top-1), memory usage (-97% parameters), and computational cost (-89.1% GFLOPs). Bottom: General comparison. StableTTA improves baseline models by 11-33% in top-1 accuracy, with 34 models achieving more than 95% accuracy. The proposed method yields consistent and significant improvements across all evaluated models.
  • Figure 2: Limitations of Ensemble Methods and Conflicts in Aggregation Strategies. (a) Limitations: Illustration of model averaging and test-time augmentation (TTA) in image classification. Both methods significantly increase inference-time computational cost while providing only marginal improvements in accuracy. (b) Conflict: Given branch logits $\{\boldsymbol{z}^{(i)} \mid i=1, 2, 3\}$, probabilities $\boldsymbol{p}^{(i)}=\operatorname{softmax}(\boldsymbol{z}^{(i)})$, and predictions $\hat{y}^{(i)}=\operatorname{argmax} \boldsymbol{p}^{(i)}$, different aggregation strategies can produce inconsistent results. In this example, logit averaging predicts class $\hat{y}_{\text{logit}}=1$, soft voting predicts class $\hat{y}_{\text{soft}}=2$, and hard voting predicts class $\hat{y}_{\text{hard}}=3$. (c) Explanation: When the logits $({\boldsymbol{z}^{(1)}, \boldsymbol{z}^{(2)}, \dots})$ are sparsely distributed in space, the conflict $\hat{y}_{\text{logit}} \neq \hat{y}_{\text{soft}} \neq \hat{y}_{\text{hard}}$ is more likely to occur due to the nonlinearity and non-bijective nature of the softmax function. In contrast, when the logits are more densely clustered, such conflicts are less likely.
  • Figure 3: Efficiency–Accuracy Trade-offs. Comparison of baseline (blue) and StableTTA (red) across parameters (left), peak GFLOPs in series mode (middle), and total GFLOPs in parallel mode (right). Series mode processes images sequentially, while parallel mode processes augmented inputs simultaneously. StableTTA improves accuracy while maintaining favorable efficiency, enabling lightweight models to outperform larger ones.
  • Figure 4: (a) TTA (with our augmentation) vs. StableTTA. (b) StableTTA is robust to $K$ and mainly affected by $N$, but disabling logit processing $(K=C)$ will significantly reduce top-1 accuracy.
  • Figure 5: Monte Carlo simulation. The conflict probability increases as $\operatorname{Var}(\boldsymbol{z})$ grows. In this simulation, we consider distributions: $\{ \boldsymbol{z} \sim \mathcal{N} (\boldsymbol{\mu}, \sigma I) \mid \mu \in \{(1, 0.9), (1, 0.7), (1, 0.5)\}, \sigma \in [0.05, 0.25] \}$. The solid curves show Monte Carlo estimates of the relationship between $\sigma$ and $\operatorname{P}(\hat{y}_{\text{logit}} \neq \hat{y}_{\text{hard}})$, while the dashed curves correspond to the theoretical (asymptotic) predictions. The empirical and theoretical results are closely matched.
  • ...and 1 more figures