Table of Contents
Fetching ...

Probing Network Decisions: Capturing Uncertainties and Unveiling Vulnerabilities Without Label Information

Youngju Joung, Sehyun Lee, Jaesik Choi

TL;DR

This work targets interpretable deep learning by introducing a prober that encodes a classifier's decision into a binary hit/miss signal using intermediate representations, enabling misclassification detection without relying on true labels. It integrates a Hit-Miss dataset and a lightweight three-layer FFN (the prober) with imbalance-mitigation techniques, and a RealNVP-based counterfactual generator to produce $ADC_{hit}(x)$ that semantically alters inputs to probe classifier weaknesses. Across MNIST, Fashion-MNIST, CIFAR-10, and ImageNette, the prober achieves strong misclassification detection performance, and analyses show that higher confidence (max probability) and lower entropy correlate with hits. Counterfactuals generated via the prober reveal actionable vulnerabilities and can significantly improve reclassification for true misses (~86.7% accuracy gain) without knowledge of the true labels, suggesting a path toward auto-correction and more objective explanations in opaque scenarios. Overall, the framework enhances trust and transparency by linking hidden representations to uncertainty, enabling targeted, label-agnostic explanations and vulnerability identification in image classification models.

Abstract

To improve trust and transparency, it is crucial to be able to interpret the decisions of Deep Neural classifiers (DNNs). Instance-level examinations, such as attribution techniques, are commonly employed to interpret the model decisions. However, when interpreting misclassified decisions, human intervention may be required. Analyzing the attribu tions across each class within one instance can be particularly labor intensive and influenced by the bias of the human interpreter. In this paper, we present a novel framework to uncover the weakness of the classifier via counterfactual examples. A prober is introduced to learn the correctness of the classifier's decision in terms of binary code-hit or miss. It enables the creation of the counterfactual example concerning the prober's decision. We test the performance of our prober's misclassification detection and verify its effectiveness on the image classification benchmark datasets. Furthermore, by generating counterfactuals that penetrate the prober, we demonstrate that our framework effectively identifies vulnerabilities in the target classifier without relying on label information on the MNIST dataset.

Probing Network Decisions: Capturing Uncertainties and Unveiling Vulnerabilities Without Label Information

TL;DR

This work targets interpretable deep learning by introducing a prober that encodes a classifier's decision into a binary hit/miss signal using intermediate representations, enabling misclassification detection without relying on true labels. It integrates a Hit-Miss dataset and a lightweight three-layer FFN (the prober) with imbalance-mitigation techniques, and a RealNVP-based counterfactual generator to produce that semantically alters inputs to probe classifier weaknesses. Across MNIST, Fashion-MNIST, CIFAR-10, and ImageNette, the prober achieves strong misclassification detection performance, and analyses show that higher confidence (max probability) and lower entropy correlate with hits. Counterfactuals generated via the prober reveal actionable vulnerabilities and can significantly improve reclassification for true misses (~86.7% accuracy gain) without knowledge of the true labels, suggesting a path toward auto-correction and more objective explanations in opaque scenarios. Overall, the framework enhances trust and transparency by linking hidden representations to uncertainty, enabling targeted, label-agnostic explanations and vulnerability identification in image classification models.

Abstract

To improve trust and transparency, it is crucial to be able to interpret the decisions of Deep Neural classifiers (DNNs). Instance-level examinations, such as attribution techniques, are commonly employed to interpret the model decisions. However, when interpreting misclassified decisions, human intervention may be required. Analyzing the attribu tions across each class within one instance can be particularly labor intensive and influenced by the bias of the human interpreter. In this paper, we present a novel framework to uncover the weakness of the classifier via counterfactual examples. A prober is introduced to learn the correctness of the classifier's decision in terms of binary code-hit or miss. It enables the creation of the counterfactual example concerning the prober's decision. We test the performance of our prober's misclassification detection and verify its effectiveness on the image classification benchmark datasets. Furthermore, by generating counterfactuals that penetrate the prober, we demonstrate that our framework effectively identifies vulnerabilities in the target classifier without relying on label information on the MNIST dataset.

Paper Structure

This paper contains 16 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the proposed framework to investigate the misclassified samples. In this example, the sample is classified as 9 while the true label is 4. Given the hidden representation of the layer of the instance, the prober predicts whether the classifier is hit or miss. Then, the counterfactual example is generated through the classifier and the prober. In this process, obstructive features are modified, contributing to the reduction of uncertainty in the classifier's confidence.
  • Figure 2: Maximum probability (first row) and entropy of probability (second row) of the classifier.
  • Figure 3: The action of the prober according to the maximum probability of the classifier. The $xy$-plane displays a 2D plane formed by three selected points from the dataset. At the bottom, for samples lying on the plane, the predicted action of the classifier (miss or hit) by the prober is denoted in red or blue. The $z$-axis represents the probability of the prediction when the sample is fed into the classifier. It implies that the prober tends to output hit for samples where the classifier is confident and miss when the classifier lacks confidence.
  • Figure 4: Counterfactual examples $ADC_{hit}(x)$ generated for the True Miss cases where both the classifier fails to classify accurately, and the prober predicts as miss. $y$ and $\hat{y}$ denote the label and the prediction of the classifier, respectively. Despite lacking information about the true label, the prober identifies vulnerabilities in the classifier for each sample $x$. With this framework, we obtain examples close to the correct answer by modifying the regions indicated in red, corresponding to $\delta x$.
  • Figure 5: Counterfactual examples $ADC_{hit}(x)$ generated for the False Miss cases where the classifier correctly classifies, but the prober predicts as miss. $y$ and $\hat{y}$ denote the label and the prediction of the classifier, respectively. Even though the original image $x$ is already correctly classified, $ADC_{hit}(x)$ is generated in a direction aimed at reducing uncertainty from the perspective of the classifier.