Minimal Sufficient Views: A DNN model making predictions with more evidence has higher accuracy
Keisuke Kawano, Takuro Kutsuna, Keisuke Sano
TL;DR
This work introduces minimal sufficient views (MSVs), a principled framework for identifying multiple disjoint regions in an input that each preserves a DNN’s predicted class, formalized by the sufficiency condition $c_f(\bm{x}) = c_f(m(\bm{x}, \mathbb{V}))$ and minimality. To make MSVs computationally practical, the authors develop GreedyMSVs with a relaxed $eta$-split-minimality, enabling fast, gradient-free estimation suitable for black-box models. Empirically, they show a robust positive correlation between the average number of MSVs and generalization performance across CNNs and Vision Transformers, and demonstrate that the MSV count is a reliable, unlabeled metric for model selection that remains stable under overfitting compared with traditional metrics. The approach also enables an XAI view that explains predictions from multiple perspectives and extends to detection models, while outperforming related methods like BG-SIS in speed and alignment with object-centered evidence. Overall, MSVs offer a scalable, evidence-grounded lens on generalization and a practical tool for model evaluation without labeled data.
Abstract
Deep neural networks (DNNs) exhibit high performance in image recognition; however, the reasons for their strong generalization abilities remain unclear. A plausible hypothesis is that DNNs achieve robust and accurate predictions by identifying multiple pieces of evidence from images. Thus, to test this hypothesis, this study proposed minimal sufficient views (MSVs). MSVs is defined as a set of minimal regions within an input image that are sufficient to preserve the prediction of DNNs, thus representing the evidence discovered by the DNN. We empirically demonstrated a strong correlation between the number of MSVs (i.e., the number of pieces of evidence) and the generalization performance of the DNN models. Remarkably, this correlation was found to hold within a single DNN as well as between different DNNs, including convolutional and transformer models. This suggested that a DNN model that makes its prediction based on more evidence has a higher generalization performance. We proposed a metric based on MSVs for DNN model selection that did not require label information. Consequently, we empirically showed that the proposed metric was less dependent on the degree of overfitting, rendering it a more reliable indicator of model performance than existing metrics, such as average confidence.
