Table of Contents
Fetching ...

Minimal Sufficient Views: A DNN model making predictions with more evidence has higher accuracy

Keisuke Kawano, Takuro Kutsuna, Keisuke Sano

TL;DR

This work introduces minimal sufficient views (MSVs), a principled framework for identifying multiple disjoint regions in an input that each preserves a DNN’s predicted class, formalized by the sufficiency condition $c_f(\bm{x}) = c_f(m(\bm{x}, \mathbb{V}))$ and minimality. To make MSVs computationally practical, the authors develop GreedyMSVs with a relaxed $eta$-split-minimality, enabling fast, gradient-free estimation suitable for black-box models. Empirically, they show a robust positive correlation between the average number of MSVs and generalization performance across CNNs and Vision Transformers, and demonstrate that the MSV count is a reliable, unlabeled metric for model selection that remains stable under overfitting compared with traditional metrics. The approach also enables an XAI view that explains predictions from multiple perspectives and extends to detection models, while outperforming related methods like BG-SIS in speed and alignment with object-centered evidence. Overall, MSVs offer a scalable, evidence-grounded lens on generalization and a practical tool for model evaluation without labeled data.

Abstract

Deep neural networks (DNNs) exhibit high performance in image recognition; however, the reasons for their strong generalization abilities remain unclear. A plausible hypothesis is that DNNs achieve robust and accurate predictions by identifying multiple pieces of evidence from images. Thus, to test this hypothesis, this study proposed minimal sufficient views (MSVs). MSVs is defined as a set of minimal regions within an input image that are sufficient to preserve the prediction of DNNs, thus representing the evidence discovered by the DNN. We empirically demonstrated a strong correlation between the number of MSVs (i.e., the number of pieces of evidence) and the generalization performance of the DNN models. Remarkably, this correlation was found to hold within a single DNN as well as between different DNNs, including convolutional and transformer models. This suggested that a DNN model that makes its prediction based on more evidence has a higher generalization performance. We proposed a metric based on MSVs for DNN model selection that did not require label information. Consequently, we empirically showed that the proposed metric was less dependent on the degree of overfitting, rendering it a more reliable indicator of model performance than existing metrics, such as average confidence.

Minimal Sufficient Views: A DNN model making predictions with more evidence has higher accuracy

TL;DR

This work introduces minimal sufficient views (MSVs), a principled framework for identifying multiple disjoint regions in an input that each preserves a DNN’s predicted class, formalized by the sufficiency condition and minimality. To make MSVs computationally practical, the authors develop GreedyMSVs with a relaxed -split-minimality, enabling fast, gradient-free estimation suitable for black-box models. Empirically, they show a robust positive correlation between the average number of MSVs and generalization performance across CNNs and Vision Transformers, and demonstrate that the MSV count is a reliable, unlabeled metric for model selection that remains stable under overfitting compared with traditional metrics. The approach also enables an XAI view that explains predictions from multiple perspectives and extends to detection models, while outperforming related methods like BG-SIS in speed and alignment with object-centered evidence. Overall, MSVs offer a scalable, evidence-grounded lens on generalization and a practical tool for model evaluation without labeled data.

Abstract

Deep neural networks (DNNs) exhibit high performance in image recognition; however, the reasons for their strong generalization abilities remain unclear. A plausible hypothesis is that DNNs achieve robust and accurate predictions by identifying multiple pieces of evidence from images. Thus, to test this hypothesis, this study proposed minimal sufficient views (MSVs). MSVs is defined as a set of minimal regions within an input image that are sufficient to preserve the prediction of DNNs, thus representing the evidence discovered by the DNN. We empirically demonstrated a strong correlation between the number of MSVs (i.e., the number of pieces of evidence) and the generalization performance of the DNN models. Remarkably, this correlation was found to hold within a single DNN as well as between different DNNs, including convolutional and transformer models. This suggested that a DNN model that makes its prediction based on more evidence has a higher generalization performance. We proposed a metric based on MSVs for DNN model selection that did not require label information. Consequently, we empirically showed that the proposed metric was less dependent on the degree of overfitting, rendering it a more reliable indicator of model performance than existing metrics, such as average confidence.
Paper Structure (37 sections, 3 equations, 18 figures, 5 tables, 1 algorithm)

This paper contains 37 sections, 3 equations, 18 figures, 5 tables, 1 algorithm.

Figures (18)

  • Figure 1: Average number of estimated MSVs for randomly sampled 1000 images from the ImageNet validation set (x-axis) and prediction accuracy on all data in the ImageNet validation set (y-axis). No label information is used to compute the MSVs. ImageNet-trained DNN models obtained from TorchVision are used in the evaluation (listed in the legend). See Section \ref{['sec:experiments']} for the experimental details.
  • Figure 2: Example of estimated MSVs for an image in the Open Images validation set. Section \ref{['sec:msv']} presents the definitions of mathematical expressions.
  • Figure 3: Execution example of MSVs search by GreedyMSVs.
  • Figure 4: MSVs for images from the Open Images validation set that are predicted to be Class 285.
  • Figure 5: Images from the Open Images validation set predicted to be Class 963. The number of estimated MSVs is 1 for the left group and greater than 5 for the right group. The estimated MSVs for each image are detailed in \ref{['apdx:msv_examples']}.
  • ...and 13 more figures

Theorems & Definitions (3)

  • Definition 1
  • Definition 2
  • Definition 3