Table of Contents
Fetching ...

Leveraging Gradients for Unsupervised Accuracy Estimation under Distribution Shift

Renchunzi Xie, Ambroise Odonnat, Vasilii Feofanov, Ievgen Redko, Jianfeng Zhang, Bo An

TL;DR

This work addresses unsupervised estimation of model accuracy under distribution shift by introducing GdScore, a lightweight proxy based on the L_p norm of the classification-layer gradient after a single gradient step on unlabeled test data. The method uses a simple pseudo-labeling strategy that combines high-confidence predictions with random labels for low-confidence samples, enabling backpropagation without access to true test labels. The authors provide theoretical connections between gradient norms and true risk under shift and demonstrate state-of-the-art empirical performance across 11 benchmarks and multiple architectures, while being significantly faster than existing self-training baselines. The approach is robust across synthetic, natural, and subpopulation shifts, making it practical for real-world deployment with large models and confidential data. Overall, the paper highlights the informative role of gradient magnitudes in generalization under distribution shift and offers a scalable tool for unsupervised accuracy estimation.

Abstract

Estimating the test performance of a model, possibly under distribution shift, without having access to the ground-truth labels is a challenging, yet very important problem for the safe deployment of machine learning algorithms in the wild. Existing works mostly rely on information from either the outputs or the extracted features of neural networks to estimate a score that correlates with the ground-truth test accuracy. In this paper, we investigate -- both empirically and theoretically -- how the information provided by the gradients can be predictive of the ground-truth test accuracy even under distribution shifts. More specifically, we use the norm of classification-layer gradients, backpropagated from the cross-entropy loss after only one gradient step over test data. Our intuition is that these gradients should be of higher magnitude when the model generalizes poorly. We provide the theoretical insights behind our approach and the key ingredients that ensure its empirical success. Extensive experiments conducted with various architectures on diverse distribution shifts demonstrate that our method significantly outperforms current state-of-the-art approaches. The code is available at https://github.com/Renchunzi-Xie/GdScore

Leveraging Gradients for Unsupervised Accuracy Estimation under Distribution Shift

TL;DR

This work addresses unsupervised estimation of model accuracy under distribution shift by introducing GdScore, a lightweight proxy based on the L_p norm of the classification-layer gradient after a single gradient step on unlabeled test data. The method uses a simple pseudo-labeling strategy that combines high-confidence predictions with random labels for low-confidence samples, enabling backpropagation without access to true test labels. The authors provide theoretical connections between gradient norms and true risk under shift and demonstrate state-of-the-art empirical performance across 11 benchmarks and multiple architectures, while being significantly faster than existing self-training baselines. The approach is robust across synthetic, natural, and subpopulation shifts, making it practical for real-world deployment with large models and confidential data. Overall, the paper highlights the informative role of gradient magnitudes in generalization under distribution shift and offers a scalable tool for unsupervised accuracy estimation.

Abstract

Estimating the test performance of a model, possibly under distribution shift, without having access to the ground-truth labels is a challenging, yet very important problem for the safe deployment of machine learning algorithms in the wild. Existing works mostly rely on information from either the outputs or the extracted features of neural networks to estimate a score that correlates with the ground-truth test accuracy. In this paper, we investigate -- both empirically and theoretically -- how the information provided by the gradients can be predictive of the ground-truth test accuracy even under distribution shifts. More specifically, we use the norm of classification-layer gradients, backpropagated from the cross-entropy loss after only one gradient step over test data. Our intuition is that these gradients should be of higher magnitude when the model generalizes poorly. We provide the theoretical insights behind our approach and the key ingredients that ensure its empirical success. Extensive experiments conducted with various architectures on diverse distribution shifts demonstrate that our method significantly outperforms current state-of-the-art approaches. The code is available at https://github.com/Renchunzi-Xie/GdScore
Paper Structure (60 sections, 5 theorems, 35 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 60 sections, 5 theorems, 35 equations, 5 figures, 10 tables, 1 algorithm.

Key Result

Theorem 4.1

Let $\mathbf{c}\in\mathbb{R}^{D\times K}$ and $\mathbf{c}'\in\mathbb{R}^{D\times K}$ be two linear classifiers. For any $p, q \geq 1$ such that $\frac{1}{p} + \frac{1}{q} = 1$, we have that

Figures (5)

  • Figure 1: Test accuracy prediction versus True test accuracy on Entity-13 with ResNet18. We compare the performance of GdScore with that of Dispersion Score and ProjNorm via scatter plots. Each point represents one dataset under certain corruption and certain severity, where different shapes represent different types of corruption, and darker color represents the higher severity level.
  • Figure 2: Runtime comparison of two self-training approaches with ResNet50.
  • Figure 3: Performance comparison ($R^2$) on $7$ datasets with ResNet18 between (a) different label generation strategies and (b) different types of losses across $3$ types of distribution shifts. Results confirm that our proposed method performs better on average across various datasets and types of shifts.
  • Figure 4: Robustness comparison for all estimation baselines across diverse distribution shifts with ResNet18.
  • Figure 5: Sensitivity analysis on the effect of (a) norm types, (b) layer selection for gradients, and (c) epoch selection. The first and the third experiments are conducted on CIFAR-10C, CIFAR-100C, and TinyImageNet-C, while the second experiment includes TinyImageNet-C, Office-31, Office-Home, and WILDS-FMoV koh2021wilds. All experiments are conducted with ResNet18.

Theorems & Definitions (11)

  • Theorem 4.1: Connection between the true risk and the $L_p$-norm of the gradient
  • Corollary 4.2: Connection after one gradient update
  • Theorem 4.3: Upper-bounding the norm of the gradient
  • Remark 6.1: Case $0<p<1$
  • Lemma H.1
  • proof
  • proof
  • Lemma H.2
  • proof
  • proof
  • ...and 1 more