Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

Yearim Kim; Sangyu Han; Sangbum Han; Nojun Kwak

Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

Yearim Kim, Sangyu Han, Sangbum Han, Nojun Kwak

TL;DR

A novel approach to systematically trace the entire pathway from input through all intermediate layers to the final output within the whole dataset, advancing the understanding of semantic significance within image models, offering a holistic view of their operational mechanics.

Abstract

In the field of eXplainable AI (XAI) in language models, the progression from local explanations of individual decisions to global explanations with high-level concepts has laid the groundwork for mechanistic interpretability, which aims to decode the exact operations. However, this paradigm has not been adequately explored in image models, where existing methods have primarily focused on class-specific interpretations. This paper introduces a novel approach to systematically trace the entire pathway from input through all intermediate layers to the final output within the whole dataset. We utilize Pointwise Feature Vectors (PFVs) and Effective Receptive Fields (ERFs) to decompose model embeddings into interpretable Concept Vectors. Then, we calculate the relevance between concept vectors with our Generalized Integrated Gradients (GIG), enabling a comprehensive, dataset-wide analysis of model behavior. We validate our method of concept extraction and concept attribution in both qualitative and quantitative evaluations. Our approach advances the understanding of semantic significance within image models, offering a holistic view of their operational mechanics.

Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

TL;DR

Abstract

Paper Structure (14 sections, 7 equations, 5 figures)

This paper contains 14 sections, 7 equations, 5 figures.

Introduction
Related Works
Method
Analysis Unit: PFV-ERF dataset
Concept Extraction
Concept vector generation
PFV decomposition
Inter-layer concept attribution
Experiment
Qualitative Analysis
Validation of our method
Validation of Concept Extraction
Validation of Inter-layer Concept Attribution
Conclusion

Figures (5)

Figure 1: Top: Causal explanation graph from high to low layers. From top to bottom, [Classifier, Layer4.2, Layer3.5, Layer3.2, Layer3.0, Layer2.3, Layer1.2], the bottleneck blocks in ResNet50. The thicker and bluer the edge, the stronger the contribution between concepts. Unlike class-wise global explanation, our method can explain the 'Shared concepts' between similar classes. Among the thousands of concepts in a layer, the graph only shows the top-5 most important concepts and top-3 shared concepts. Bottom left: Detailed concept visualization of Concept 3,320 "Bird chest" at Layer3.5 Block. With the top 10 nearest embeddings, we can observe Concept 3,320 is "Bird chest." With its concept localization image, we can effectively see where concept 3,320 resides in input images of the classes, house finch and junco, respectively. Bottom right: Concept 3,209 "Round head" at Layer3.2 Block. The top-1 representation image of concept 3,209 (A girl's head) seems irrelevant to the class, junco bird. However, with the concept localization image and the corresponding top-10 nearest embeddings, we can see that Concept 3,209 represents the round head of various objects.
Figure 2: Method overview. 3.1) Our dataset. The Pointwise Feature Vector (PFV) in the hidden layer is assigned a meaning by labeling it with the Effective Receptive Field (ERF). The ERF image (the blue area in the picture) uses a single color to represent the importance, making it difficult to interpret. Therefore, the context portion, around the ERF, is added to make the significance of the ERF more understandable. 3.2) Layer-wise concept extraction. 3.2.1) The PFV vector space exhibits a diverse density, with high density around specific concepts and sparsity elsewhere. Hence, bisecting clustering, suitable for such data structures, is employed to extract concept vectors. The meaning of each concept vector is then explained through the sample with the highest cosine similarity to the concept vector. 3.2.2) Reconstruct the PFV and embeddings in a layer with the extracted concept vectors. 3.3) Inter-layer concept attribution, employing Generalized Integrated Gradients (GIG).
Figure 3: Left: Causal explanation graph of 'Foxhound'. From top to bottom, [Classifier, Layer4.2, Layer3.5, Layer3.2, Layer3.0, Layer2.3, Layer1.2], the bottleneck blocks in ResNet50. Our method can also provide a dataset-wide explanation of a single image. Right: Detailed concept visualization of the colored boxes, 1) green, 2) blue, 3) purple, and 4) brown. ① Concept 3,545 "Dog Body" at Layer4.2 Block. ② Concept 5,464 "Dog Leg" at Layer3.5 Block. ③ Concept 1,838 "Rounded Cone" at Layer3.2 Block. With Top-1 representation image of 'folded arm', the concept seems irrelevant to the input image. However, the concept localization and the top-10 nearest embeddings show that Concept 1,838 represents "Rounded Cone". ④ Concept 1,191 "Eye" at Layer2.3 Block. Best viewed when enlarged.
Figure 4: Validation of Concept Extraction. Top Left: Comparison of C-Insertion and C-Deletion curves for three concept extraction methods applied to ResNet50's first block of first stage (Layer1.0). In C-Insertion, 'Ours' achieves the highest final score, leading to highest AUC. In C-Deletion, 'Ours' degrades slowly, due to the highest initial prediction score. Top Right: AUC differences across different block numbers for a balanced comparison, as there is a tendency that the better the insertion performance, the worse the deletion performance. It shows that 'Ours' generally outperforms the other methods across various blocks. Bottom: Top 3 most important concepts found by Sparse AutoEncoder (SAE) and 'Ours' for classifying grasshopper image at Layer4.0. Even though SAE excels our method in AUC difference on later layers, the concepts extracted by SAE seem less persuasive than those from 'Ours'.
Figure 5: Left: Deletion (Top row) and Insertion (Bottom row) scores across consecutive layers in ResNet50. The blue curves represent our method (GIG), while the green curves denote random attribution. Our method consistently outperforms random attribution, as indicated by the significantly steeper decline in deletion scores and the sharper rise in insertion scores. Right: AUC Difference by Layer Transition. It quantifies our superiority, showing our AUC difference achieving substantially higher AUC differences across all layer transitions.

Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

TL;DR

Abstract

Decompose the model: Mechanistic interpretability in image models with Generalized Integrated Gradients (GIG)

Authors

TL;DR

Abstract

Table of Contents

Figures (5)