Assessing Model Generalization in Vicinity

Yuchi Liu; Yifan Sun; Jingdong Wang; Liang Zheng

Assessing Model Generalization in Vicinity

Yuchi Liu, Yifan Sun, Jingdong Wang, Liang Zheng

TL;DR

This paper proposes incorporating responses from neighboring test samples into the correctness assessment of each individual sample, and shows that applying the VRP method to existing generalization indicators, such as average confidence and effective invariance, consistently improves over these baselines both methodologically and experimentally.

Abstract

This paper evaluates the generalization ability of classification models on out-of-distribution test sets without depending on ground truth labels. Common approaches often calculate an unsupervised metric related to a specific model property, like confidence or invariance, which correlates with out-of-distribution accuracy. However, these metrics are typically computed for each test sample individually, leading to potential issues caused by spurious model responses, such as overly high or low confidence. To tackle this challenge, we propose incorporating responses from neighboring test samples into the correctness assessment of each individual sample. In essence, if a model consistently demonstrates high correctness scores for nearby samples, it increases the likelihood of correctly predicting the target sample, and vice versa. The resulting scores are then averaged across all test samples to provide a holistic indication of model accuracy. Developed under the vicinal risk formulation, this approach, named vicinal risk proxy (VRP), computes accuracy without relying on labels. We show that applying the VRP method to existing generalization indicators, such as average confidence and effective invariance, consistently improves over these baselines both methodologically and experimentally. This yields a stronger correlation with model accuracy, especially on challenging out-of-distribution test sets.

Assessing Model Generalization in Vicinity

TL;DR

Abstract

Paper Structure (26 sections, 17 equations, 11 figures, 6 tables)

This paper contains 26 sections, 17 equations, 11 figures, 6 tables.

Introduction
Related Work
Preliminaries
Risk and Accuracy in Supervised Evaluation
Vicinal Risk Minimization
Methodology
Risk Proxy in Unsupervised Evaluation
Proposed Vicinal Risk Proxy
Apply Vicinal Risk Proxy to Existing Proxies.
Discussions
Experiments
Datasets and Evaluation Metrics
Existing Risk Proxies as Baselines
Main Observations
Further Analysis of Vicinal Risk Proxy
...and 11 more sections

Figures (11)

Figure 1: Illustrating spurious model responses and how our method corrects them, we use confidence as a model generalization indicator. For a given input $\bm{x}$ from ImageNet-R hendrycks2021many, models $f_1$ and $f_2$ incorrectly and correctly classify it, respectively. However, the confidence score (0.991) for the incorrect prediction by classifier $f_1$ is excessively higher than the score (0.245) for the correct prediction by classifier $f_2$, indicating spuriousness. The model generalization ability ranking based on this test sample fails. Our proposed vicinal method, a similarity-weighted sum of confidence, provides more reasonable scores (0.431and 0.839).
Figure 2: Examples demonstrating how vicinal risks of individual samples can more efficiently differentiate between models making correct and incorrect predictions are provided. In (a) - (b), a single test sample from ImageNet-R is used along with 140 models trained on the ImageNet training set. We employ risk estimates based on confidencetu2023assessinghendrycks17baseline, confidence combined with our method, EI deng2022strong, and EI combined with our method. The distributions of these risk estimates across the 140 models for a given test sample are illustrated. In (c) - (d), results are reported for another test sample. Notably, confidence and EI, which rely on the sample in isolation, lead to spurious model responses. In the top row, many models making incorrect predictions may exhibit excessively high confidence/EI, while those making correct predictions may show unexpectedly low confidence/EI. In contrast, our method (bottom row)) effectively corrects these erroneous risk estimates, enhancing the separation of risk estimates on individual samples between good and poor models. Consequently, the vicinal risk proxy averaged over the entire out-of-distribution (OOD) test set becomes a more reliable indicator of model accuracy. Additional examples are provided in the supplementary material (Fig. \ref{['fig:model_histogram_AC']}, \ref{['fig:model_histogram_DoC']} and \ref{['fig:model_histogram_ATC']}), and further statistical results are presented in Fig \ref{['fig:overlap_bar']}.
Figure 3: The average overlap of risk estimates for individual samples between correct and incorrect model predictions. We first estimate the distributions of risk estimate scores for correct and incorrect model predictions (140 in total) for each test sample. Then, the overlap of the two distributions for each sample is computed and finally averaged over the entire test set. All models are trained on ImageNet. In each figure, we use four test sets, ImageNet-A (A), ImageNet-R (R), ImageNet-S (S), and ObjectNet (O). From (a) to (e), EI, AC, CI, DoC, and ATC are used as baselines, respectively. a smaller value indicates lower overlap or higher separability. We clearly observe that vicinal risk scores (ours) statistically better differentiate models making correct and incorrect predictions by better separating their scores.
Figure 4: Correlation between Effective Invariance (EI) and Accuracy: Each dot in the figures represents a model, and straight lines are fitted using robust linear regression Huber2011. Blue dots represent the rectified score (VRP score) of these models, bringing their rank closer to the actual accuracy rank. Conversely, the rank of red models deviates further from the real accuracy when using the VRP paradigm. The rank of black models remains unchanged. The symbols $\rho$ and $\gamma$ have the same meaning as in Table \ref{['tab:benchmark_testing']}. The shaded region in each figure represents a 95% confidence interval for the linear fit, calculated from 1,000 bootstrap samples. The VRP paradigm effectively rectifies the proxy score for the majority of models in both the ImageNet-R and ObjectNet datasets. Additional results for alternative risk proxies are presented in the supplementary material (Sec. \ref{['sec:append-data']}).
Figure 5: Impact of the number of neighbors $m$ on the correlation between proxy scores and accuracy. We use five existing proxies as baselines and report the mean and standard deviation for each data point. We observe that vicinal assessment is consistently beneficial under various $m$ values and yields stronger correlation with $m$ increases.
...and 6 more figures

Assessing Model Generalization in Vicinity

TL;DR

Abstract

Assessing Model Generalization in Vicinity

Authors

TL;DR

Abstract

Table of Contents

Figures (11)