What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?

Weijie Tu; Weijian Deng; Liang Zheng; Tom Gedeon

What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?

Weijie Tu, Weijian Deng, Liang Zheng, Tom Gedeon

TL;DR

The paper tackles ranking classifiers when test data are unlabeled and drawn from out-of-distribution distributions. It introduces SoftmaxCorr, a measure that combines prediction certainty and diversity by computing a class-class correlation matrix $\mathbf{C}=\frac{\mathbf{P}^T\mathbf{P}}{N}$ and comparing it to a diagonal reference $\mathbf{R}$ via $\cos(\mathbf{C},\mathbf{R})$, with $\mathbf{R}$ estimated from a zero-shot vision-language model. Evaluated across ImageNet, CIFAR-10, and WILDS with 573 models and multiple OOD datasets, SoftmaxCorr consistently achieves strong, stable correlations with ground-truth generalization $G_m$ and often outperforms baselines like AoL, ATC-MC, MaxPred, and SoftGap. The work demonstrates that probability-based OOD measures can effectively rank models without labeled data and remain informative under domain adaptation settings, while also outlining limitations and directions for improving class-distribution estimation and robustness. Overall, SoftmaxCorr offers a practical, scalable tool for model selection under distribution shifts with potential impact on deployment and monitoring of real-world systems.

Abstract

This work aims to develop a measure that can accurately rank the performance of various classifiers when they are tested on unlabeled data from out-of-distribution (OOD) distributions. We commence by demonstrating that conventional uncertainty metrics, notably the maximum Softmax prediction probability, possess inherent utility in forecasting model generalization across certain OOD contexts. Building on this insight, we introduce a new measure called Softmax Correlation (SoftmaxCorr). It calculates the cosine similarity between a class-class correlation matrix, constructed from Softmax output vectors across an unlabeled test dataset, and a predefined reference matrix that embodies ideal class correlations. A high resemblance of predictions to the reference matrix signals that the model delivers confident and uniform predictions across all categories, reflecting minimal uncertainty and confusion. Through rigorous evaluation across a suite of datasets, including ImageNet, CIFAR-10, and WILDS, we affirm the predictive validity of SoftmaxCorr in accurately forecasting model performance within both in-distribution (ID) and OOD settings. Furthermore, we discuss the limitations of our proposed measure and suggest avenues for future research.

What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?

TL;DR

and comparing it to a diagonal reference

via

, with

estimated from a zero-shot vision-language model. Evaluated across ImageNet, CIFAR-10, and WILDS with 573 models and multiple OOD datasets, SoftmaxCorr consistently achieves strong, stable correlations with ground-truth generalization

and often outperforms baselines like AoL, ATC-MC, MaxPred, and SoftGap. The work demonstrates that probability-based OOD measures can effectively rank models without labeled data and remain informative under domain adaptation settings, while also outlining limitations and directions for improving class-distribution estimation and robustness. Overall, SoftmaxCorr offers a practical, scalable tool for model selection under distribution shifts with potential impact on deployment and monitoring of real-world systems.

Abstract

Paper Structure (42 sections, 4 figures, 6 tables)

This paper contains 42 sections, 4 figures, 6 tables.

Introduction
Related Work
OOD generalization.
Unsupervised accuracy estimation (UAE)
Task Formulation
Evaluation metrics.
Softmax Probability-based OOD Measures
What Makes OOD Measures Interesting?
Beyond Accuracy-on-the-Line (AoL).
Why Use Softmax Prediction Probability?
Proof of concept.
Exploring More Empirical Measures
Average Thresholded Confidence with Maximum Confidence
Softmax Gap
Ours: Softmax Correlation (SoftmaxCorr)
...and 27 more sections

Figures (4)

Figure 1: Correlation study between MaxPred and accuracy (%) on ImageNet-S and ImageNet-R. Every point denotes a classifier. We use $173$ ImageNet models and $89$ vision--language models introduced in Section \ref{['sec:experiment']}. The straight line is fit with robust linear regression huber2011robust and the shadow means the 95% Clopper-Pearson confidence intervals. We show that MaxPred exhibits a moderate correlation with accuracy, while accuracy on ImageNet-validation shows a relatively low correlation with performance on ImageNet-R. Moreover, Vision-Language Models (VLMs) exhibit varying linear trends in terms of their ID and OOD accuracy compared to standard supervised models.
Figure 2: SoftmaxCorr vs. model generalization under ImageNet, CIFAR-10 and WILDS setups. In each subfigure, each point denotes a model trained for the corresponding task. For ImageNet setup, OOD test sets are ObjectNet and ImageNet-S and ImageNet-Blur. For CIFAR-10 setup, OOD test sets are CIFAR-10.2, CINIC and CIFAR-10-Noise. For WILDS, OOD test sets are Camelyon17-OOD and DomainNet-OOD. The $y$-axis is top-1 accuracy, top-1 accuracy and macro-F1 for the three setups, respectively. Straight lines are fit with robust linear regression huber2011robust. Axes are probit scaled as described in Section \ref{['sec:formulation']}. We observe that SoftmaxCorr is a reliable and effective metric. Particularly on ImageNet, SoftmaxCorr is predictive of model generalization with strong performance ($\rho > 0.92$).
Figure 3: (a) Impacts of class distribution estimator, we use three estimators: ViT-H-14, ViT-bigG-14-CLIPA and the ground truth. We find SoftmaxCorr is fairly stable. (b) SoftmaxCorr v.s. accuracy on ImageNet-C benchmark. In every subfigure, each dot indicates a dataset of ImageNet-C.. We see strong correlations between SoftmaxCorr and OOD accuracy on various test set.
Figure 4: Correlation analysis: SoftmaxCorr and accuracy on CINIC. Each point represents a checkpoint. We consider CIFAR-10 models: ResNet-20, DenseNet-121, VGG-11 and MobileNet. Axes are probit scaled as in Section \ref{['sec:formulation']}. In each subfigure, every point means a checkpoint of the model along the training process. For four models, we see strong correlations ($\rho > 0.93$). This suggests that SoftmaxCorr is helpful in assessing checkpoints along the training process.

What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?

TL;DR

Abstract

What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?

Authors

TL;DR

Abstract

Table of Contents

Figures (4)