Table of Contents
Fetching ...

IT$^3$: Idempotent Test-Time Training

Nikita Durasov, Assaf Shocher, Doruk Oner, Gal Chechik, Alexei A. Efros, Pascal Fua

TL;DR

IT$^3$ introduces Idempotent Test-Time Training to adapt to distribution shifts on-the-fly using only the current test input. By enforcing an idempotence-based objective and using a frozen (or EMA) anchor during test-time updates, the method replaces domain-specific auxiliary tasks with a universal regularizer that pulls OOD representations toward the training distribution. The approach yields consistent improvements across diverse tasks (image classification, segmentation, age prediction, aerodynamics) and architectures (MLPs, CNNs, GNNs), including large-scale ImageNet-C, while maintaining practical inference costs. The results reveal a strong link between idempotence and prediction confidence, suggesting idempotence as a general principle for robust test-time adaptation with broad real-world impact.

Abstract

Deep learning models often struggle when deployed in real-world settings due to distribution shifts between training and test data. While existing approaches like domain adaptation and test-time training (TTT) offer partial solutions, they typically require additional data or domain-specific auxiliary tasks. We present Idempotent Test-Time Training (IT$^3$), a novel approach that enables on-the-fly adaptation to distribution shifts using only the current test instance, without any auxiliary task design. Our key insight is that enforcing idempotence -- where repeated applications of a function yield the same result -- can effectively replace domain-specific auxiliary tasks used in previous TTT methods. We theoretically connect idempotence to prediction confidence and demonstrate that minimizing the distance between successive applications of our model during inference leads to improved out-of-distribution performance. Extensive experiments across diverse domains (including image classification, aerodynamics prediction, and aerial segmentation) and architectures (MLPs, CNNs, GNNs) show that IT$^3$ consistently outperforms existing approaches while being simpler and more widely applicable. Our results suggest that idempotence provides a universal principle for test-time adaptation that generalizes across domains and architectures.

IT$^3$: Idempotent Test-Time Training

TL;DR

IT introduces Idempotent Test-Time Training to adapt to distribution shifts on-the-fly using only the current test input. By enforcing an idempotence-based objective and using a frozen (or EMA) anchor during test-time updates, the method replaces domain-specific auxiliary tasks with a universal regularizer that pulls OOD representations toward the training distribution. The approach yields consistent improvements across diverse tasks (image classification, segmentation, age prediction, aerodynamics) and architectures (MLPs, CNNs, GNNs), including large-scale ImageNet-C, while maintaining practical inference costs. The results reveal a strong link between idempotence and prediction confidence, suggesting idempotence as a general principle for robust test-time adaptation with broad real-world impact.

Abstract

Deep learning models often struggle when deployed in real-world settings due to distribution shifts between training and test data. While existing approaches like domain adaptation and test-time training (TTT) offer partial solutions, they typically require additional data or domain-specific auxiliary tasks. We present Idempotent Test-Time Training (IT), a novel approach that enables on-the-fly adaptation to distribution shifts using only the current test instance, without any auxiliary task design. Our key insight is that enforcing idempotence -- where repeated applications of a function yield the same result -- can effectively replace domain-specific auxiliary tasks used in previous TTT methods. We theoretically connect idempotence to prediction confidence and demonstrate that minimizing the distance between successive applications of our model during inference leads to improved out-of-distribution performance. Extensive experiments across diverse domains (including image classification, aerodynamics prediction, and aerial segmentation) and architectures (MLPs, CNNs, GNNs) show that IT consistently outperforms existing approaches while being simpler and more widely applicable. Our results suggest that idempotence provides a universal principle for test-time adaptation that generalizes across domains and architectures.
Paper Structure (22 sections, 24 equations, 16 figures, 5 tables)

This paper contains 22 sections, 24 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Idempotent Test-Time Training (IT$^3$) approach. During training (left), the model $f_{\theta}$ is trained to predict the label $y$ with or without $y$ given to it as input. At test time (right), when given a corrupted input, the model is sequentially applied. It then briefly trains with the objective of making $f_{\theta}(\mathbf{x},\cdot)$ to be idempotent using only the current test input.
  • Figure 2: Idempotence vs. Out-of-Distributionness: We plot the distribution of idempotence errors, measured by the distance $| y_1 - y_2 |$ in Eq.\ref{['eq:zigzag']}, for training, test, and OOD data. For OOD samples, we show the errors both before and after minimizing them globally. OOD samples exhibit significantly larger idempotence errors, which decrease after optimization. Figuratively, IT$^3$ pushes the OOD representations to be more similar to those of the training distribution. In Sec.\ref{['sec:experiments']}, we show that this reduction yields improved performance.
  • Figure 3: UCI Results on OOD inputs: The plots illustrate the performance of IT$^3$ compared to other baselines across different OOD levels. The box plot for tabular data shows the distribution of MAE at various OOD levels, where $IT^3$ with different batch sizes ([batch=1, batch=4, batch=8]) degrades less compared to the Not optimized baseline and ActMAD. Larger batch sizes preserve performance more effectively.
  • Figure 4: Test error (%) on CIFAR-10-C with level 5 corruptions. We compare our approaches, $IT^{3}$, with object recognition without self-supervision, TTT, and ActMAD. $IT^3$ improves over other baselines and higher batch size improves even further.
  • Figure 5: Face Samples. The (top) row shows training images of middle-aged individuals, while (middle) and (bottom) display images of older and younger individuals (OOD).
  • ...and 11 more figures