Labeling Neural Representations with Inverse Recognition

Kirill Bykov; Laura Kopf; Shinichi Nakajima; Marius Kloft; Marina M. -C. Höhne

Labeling Neural Representations with Inverse Recognition

Kirill Bykov, Laura Kopf, Shinichi Nakajima, Marius Kloft, Marina M. -C. Höhne

TL;DR

INVERT introduces Inverse Recognition, a scalable method for labeling neural representations with compositional human concepts by maximizing an AUC-based similarity between a representation and a constructed concept. It builds compositional explanations from atomic concepts using AND/OR/NOT operators, optimized via beam-search at fixed length $L$ with a constraint on concept fraction $T(\varphi(\mathcal{C}))$ and beam size $B$. A statistical significance test based on the Wilcoxon–Mann–Whitney framework provides $p$-values for explanations, addressing randomness concerns common in IoU-based methods. The approach demonstrates utility across detecting spurious correlations, explaining circuits, and enabling handcrafted circuits, while offering a simplicity-precision tradeoff and favorable comparisons to IoU-based baselines. These capabilities advance XAI by delivering interpretable, statistically validated explanations without requiring segmentation masks, with practical impact on model auditing and symbolic analysis of neural representations.

Abstract

Deep Neural Networks (DNNs) demonstrate remarkable capabilities in learning complex hierarchical data representations, but the nature of these representations remains largely unknown. Existing global explainability methods, such as Network Dissection, face limitations such as reliance on segmentation masks, lack of statistical significance testing, and high computational demands. We propose Inverse Recognition (INVERT), a scalable approach for connecting learned representations with human-understandable concepts by leveraging their capacity to discriminate between these concepts. In contrast to prior work, INVERT is capable of handling diverse types of neurons, exhibits less computational complexity, and does not rely on the availability of segmentation masks. Moreover, INVERT provides an interpretable metric assessing the alignment between the representation and its corresponding explanation and delivering a measure of statistical significance. We demonstrate the applicability of INVERT in various scenarios, including the identification of representations affected by spurious correlations, and the interpretation of the hierarchical structure of decision-making within the models.

Labeling Neural Representations with Inverse Recognition

TL;DR

with a constraint on concept fraction

and beam size

. A statistical significance test based on the Wilcoxon–Mann–Whitney framework provides

-values for explanations, addressing randomness concerns common in IoU-based methods. The approach demonstrates utility across detecting spurious correlations, explaining circuits, and enabling handcrafted circuits, while offering a simplicity-precision tradeoff and favorable comparisons to IoU-based baselines. These capabilities advance XAI by delivering interpretable, statistically validated explanations without requiring segmentation masks, with practical impact on model auditing and symbolic analysis of neural representations.

Abstract

Paper Structure (22 sections, 5 equations, 16 figures, 6 tables)

This paper contains 22 sections, 5 equations, 16 figures, 6 tables.

Introduction
Related work
INVERT: Interpreting Neural Representations with Inverse Recognition
Finding Optimal Compositional Explanations
Statistical significance
Analysis
Simplicity-Precision tradeoff
Evaluating the Accuracy of Explanations
Applications
Finding Spurious Correlations by Integrating New Concepts
Explaining Circuits
Handcrafting Circuits
Disscussion and Conclusion
Appendix
Broader Impact
...and 7 more sections

Figures (16)

Figure 1: Demonstration of the INVERT method ($B = 1, \alpha = 0.35\%$) for the neuron $f_{33}$ from ResNet18, AvgPool layer (Neuron 33), using ImageNet 2012 validation dataset. The resulting explanations can be observed in the bottom part of the figure, where three steps of the iterative process are demonstrated from $L=1$ to $L=3$. It can be observed that INVERT explanations align with the neuron’s high-activating images, illustrated in the top right figure.
Figure 2: The figure illustrates the contrast between a poor explanation (on the left) and INVERT explanations with $L=1$ and varying parameter $\alpha$, for neuron 592 in the ViT B 16 feature-extractor layer. The INVERT explanations were computed over the ImageNet 2012 validation set. The figure demonstrates that as the parameter $\alpha$ increases, the concept fraction $T$ also increases, indicating that more data points belong to the positive class. Furthermore, this figure showcases the proposed method’s ability to evaluate the statistical significance of the result. The poor explanation fails the statistical significance test (double-sided alternative) with a p-value of 0.35, while all explanations provided by INVERT exhibit a $p < 0.005$.
Figure 3: Three different INVERT explanations, computed by adjusting the parameter $\alpha$ for the Neuron 88 in ResNet18 AvgPool layer. Higher values of this parameter lead to broader explanations, albeit at the cost of precision, thus resulting in a lower AUC. The visualization of the WordNet taxonomy for the hypernyms is provided in the Appendix \ref{['fig:simplicity-precision-tradeoff-qualitative']}.
Figure 4: Impact of the parameter $\alpha$ and formula length $L$ on the resulting explanations. The first row of the figure shows the average AUC of optimal explanations for 50 randomly sampled neurons from the feature-extractor part of each one of the four ImageNet pre-trained models, conditioned by different values of parameter $\alpha$ in different colors. These graphs indicate that neurons generally tend to achieve the highest AUC for one individual class with $L=1$ and $\alpha = 0$. The second row presents the distribution of AUC scores alongside the distribution of concept fractions $T$ for the INVERT explanations of length $L=5$, for each model. Here, we can observe a clear trade-off between the precision of the explanation in terms of AUC measure and concept size $T.$
Figure 5: Comparing the computational cost of INVERT with Compositional Explanations of Neurons method (CompExp) in hours with varying formula lengths.
...and 11 more figures

Theorems & Definitions (4)

Definition 1: Neural representation
Definition 2: Concepts
Definition 3: AUC similarity
Definition 4: Compositional concept

Labeling Neural Representations with Inverse Recognition

TL;DR

Abstract

Labeling Neural Representations with Inverse Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (16)

Theorems & Definitions (4)