What Does Softmax Probability Tell Us about Classifiers Ranking Across Diverse Test Conditions?
Weijie Tu, Weijian Deng, Liang Zheng, Tom Gedeon
TL;DR
The paper tackles ranking classifiers when test data are unlabeled and drawn from out-of-distribution distributions. It introduces SoftmaxCorr, a measure that combines prediction certainty and diversity by computing a class-class correlation matrix $\mathbf{C}=\frac{\mathbf{P}^T\mathbf{P}}{N}$ and comparing it to a diagonal reference $\mathbf{R}$ via $\cos(\mathbf{C},\mathbf{R})$, with $\mathbf{R}$ estimated from a zero-shot vision-language model. Evaluated across ImageNet, CIFAR-10, and WILDS with 573 models and multiple OOD datasets, SoftmaxCorr consistently achieves strong, stable correlations with ground-truth generalization $G_m$ and often outperforms baselines like AoL, ATC-MC, MaxPred, and SoftGap. The work demonstrates that probability-based OOD measures can effectively rank models without labeled data and remain informative under domain adaptation settings, while also outlining limitations and directions for improving class-distribution estimation and robustness. Overall, SoftmaxCorr offers a practical, scalable tool for model selection under distribution shifts with potential impact on deployment and monitoring of real-world systems.
Abstract
This work aims to develop a measure that can accurately rank the performance of various classifiers when they are tested on unlabeled data from out-of-distribution (OOD) distributions. We commence by demonstrating that conventional uncertainty metrics, notably the maximum Softmax prediction probability, possess inherent utility in forecasting model generalization across certain OOD contexts. Building on this insight, we introduce a new measure called Softmax Correlation (SoftmaxCorr). It calculates the cosine similarity between a class-class correlation matrix, constructed from Softmax output vectors across an unlabeled test dataset, and a predefined reference matrix that embodies ideal class correlations. A high resemblance of predictions to the reference matrix signals that the model delivers confident and uniform predictions across all categories, reflecting minimal uncertainty and confusion. Through rigorous evaluation across a suite of datasets, including ImageNet, CIFAR-10, and WILDS, we affirm the predictive validity of SoftmaxCorr in accurately forecasting model performance within both in-distribution (ID) and OOD settings. Furthermore, we discuss the limitations of our proposed measure and suggest avenues for future research.
