Table of Contents
Fetching ...

Monitoring Risks in Test-Time Adaptation

Mona Schirmer, Metod Jazbec, Christian A. Naesseth, Eric Nalisnick

TL;DR

The paper addresses the problem of performance degradation under distribution shift when deploying models that perform test-time adaptation (TTA). It introduces a risk-monitoring framework based on sequential testing with time-uniform confidence sequences to detect when the running test risk $\bar{R}_t(p_{1:t})$ exceeds the source risk $R_0(p_0)$ by at least $\epsilon_{tol}$, even without test labels. A key contribution is an unsupervised lower bound $L_t^b$ on the running risk derived from a loss proxy $u_k=g({\bm{x}}_k,p_k)$ and online-threshold calibration, enabling an alarm $\Phi_t^b$ with provable false-alarm control. The method is validated across diverse datasets and TTA methods, demonstrating reliable risk detection and the ability to identify TTA collapse, while remaining robust to various shift types; it thus enables safer deployment of adaptive models in dynamic environments where labeled feedback is scarce.

Abstract

Encountering shifted data at test time is a ubiquitous challenge when deploying predictive models. Test-time adaptation (TTA) methods address this issue by continuously adapting a deployed model using only unlabeled test data. While TTA can extend the model's lifespan, it is only a temporary solution. Eventually the model might degrade to the point that it must be taken offline and retrained. To detect such points of ultimate failure, we propose pairing TTA with risk monitoring frameworks that track predictive performance and raise alerts when predefined performance criteria are violated. Specifically, we extend existing monitoring tools based on sequential testing with confidence sequences to accommodate scenarios in which the model is updated at test time and no test labels are available to estimate the performance metrics of interest. Our extensions unlock the application of rigorous statistical risk monitoring to TTA, and we demonstrate the effectiveness of our proposed TTA monitoring framework across a representative set of datasets, distribution shift types, and TTA methods.

Monitoring Risks in Test-Time Adaptation

TL;DR

The paper addresses the problem of performance degradation under distribution shift when deploying models that perform test-time adaptation (TTA). It introduces a risk-monitoring framework based on sequential testing with time-uniform confidence sequences to detect when the running test risk exceeds the source risk by at least , even without test labels. A key contribution is an unsupervised lower bound on the running risk derived from a loss proxy and online-threshold calibration, enabling an alarm with provable false-alarm control. The method is validated across diverse datasets and TTA methods, demonstrating reliable risk detection and the ability to identify TTA collapse, while remaining robust to various shift types; it thus enables safer deployment of adaptive models in dynamic environments where labeled feedback is scarce.

Abstract

Encountering shifted data at test time is a ubiquitous challenge when deploying predictive models. Test-time adaptation (TTA) methods address this issue by continuously adapting a deployed model using only unlabeled test data. While TTA can extend the model's lifespan, it is only a temporary solution. Eventually the model might degrade to the point that it must be taken offline and retrained. To detect such points of ultimate failure, we propose pairing TTA with risk monitoring frameworks that track predictive performance and raise alerts when predefined performance criteria are violated. Specifically, we extend existing monitoring tools based on sequential testing with confidence sequences to accommodate scenarios in which the model is updated at test time and no test labels are available to estimate the performance metrics of interest. Our extensions unlock the application of rigorous statistical risk monitoring to TTA, and we demonstrate the effectiveness of our proposed TTA monitoring framework across a representative set of datasets, distribution shift types, and TTA methods.

Paper Structure

This paper contains 49 sections, 5 theorems, 28 equations, 10 figures, 2 algorithms.

Key Result

Proposition 1

Assume a non-negative, bounded loss $\ell \in [0, M], M > 0$. Further, assume that for a sequence of losses ${\mathbf{z}}_{0:t}$, a sequence of loss proxies ${\mathbf{u}}_{0:t}$ together with thresholds $\lambda_{0}, \ldots, \lambda_t \in \mathbb{R}, \tau \in (0, M)$ satisfying Assumption eq:assumpt

Figures (10)

  • Figure 1: Alarm $\Phi_t$ is raised at $t_{\text{min}}$ as the lower bound $L_t$ on the running test risk $\bar{R}_t$ exceeds the upper bound $U_0$ on the source risk $R_0$.
  • Figure 2: Test risk of increasing severity on ImageNet-C (GN): Our unsupervised lower bound $\hat{L}_t^{b}$ on the empirical test risk $\hat{\bar{R}}_t$ closely follows the supervised lower bound $\hat{L}_t^{a}$.
  • Figure 3: Estimated test risk for different datasets and TTA methods: Our lower bound $\hat{L}_t^{b}$ consistently exceeds the risk threshold $\hat{U}_0 + \epsilon_{\text{tol}}$ when a true risk violation occurs (ImageNet severity 5, Yearbook), while remaining below it on benign shifts (ImageNet ID, FMoW-Time), across all TTA methods.
  • Figure 4: Collapsed vs. non-collapsed model on ImageNet-C (GN): When collapsed (right), the model always predicts the same class, which our monitor flags.
  • Figure 5: Comparison of loss proxies for last-layer TTA methods on ImageNet-C (GN) severity 5. Distance to class prototype is more effective than uncertainty for this TTA class.
  • ...and 5 more figures

Theorems & Definitions (9)

  • Proposition 1
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Corollary 1
  • proof
  • Proposition 3
  • proof