Causal Interpretation of Neural Network Computations with Contribution Decomposition

Joshua Brendan Melander; Zaki Alaoui; Shenghua Liu; Surya Ganguli; Stephen A. Baccus

Causal Interpretation of Neural Network Computations with Contribution Decomposition

Joshua Brendan Melander, Zaki Alaoui, Shenghua Liu, Surya Ganguli, Stephen A. Baccus

TL;DR

This work introduces CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone.

Abstract

Understanding how neural networks transform inputs into outputs is crucial for interpreting and manipulating their behavior. Most existing approaches analyze internal representations by identifying hidden-layer activation patterns correlated with human-interpretable concepts. Here we take a direct approach to examine how hidden neurons act to drive network outputs. We introduce CODEC (Contribution Decomposition), a method that uses sparse autoencoders to decompose network behavior into sparse motifs of hidden-neuron contributions, revealing causal processes that cannot be determined by analyzing activations alone. Applying CODEC to benchmark image-classification networks, we find that contributions grow in sparsity and dimensionality across layers and, unexpectedly, that they progressively decorrelate positive and negative effects on network outputs. We further show that decomposing contributions into sparse modes enables greater control and interpretation of intermediate layers, supporting both causal manipulations of network output and human-interpretable visualizations of distinct image components that combine to drive that output. Finally, by analyzing state-of-the-art models of neural activity in the vertebrate retina, we demonstrate that CODEC uncovers combinatorial actions of model interneurons and identifies the sources of dynamic receptive fields. Overall, CODEC provides a rich and interpretable framework for understanding how nonlinear computations evolve across hierarchical layers, establishing contribution modes as an informative unit of analysis for mechanistic insights into artificial neural networks.

Causal Interpretation of Neural Network Computations with Contribution Decomposition

TL;DR

Abstract

Paper Structure (30 sections, 18 equations, 19 figures)

This paper contains 30 sections, 18 equations, 19 figures.

A framework for understanding biological and artificial neural networks
Measuring contributions of hidden-layer neurons
Layerwise evolution of neural contributions in CNNs
Decomposing contributions into computational modes
Controlling network behavior using contribution modes
Visualizing inputs that cause hidden contributions
Interpreting biological neural network models with CODEC
CODEC on Vision Transformers
Conclusion
Supplemental Material
Spatial aggregation and E/I separation
Correlation analysis with semantic concepts
Robustness to SAE Hyperparameters
Runtime measurements and Complexity
CODEC on ViTs
...and 15 more sections

Figures (19)

Figure 1: Understanding the contribution of an intermediate component to downstream computation. (A) Biological and (B) artificial neural circuits construct computations by combining sets of upstream components in an input-dependent manner. The action of a network component $Z$ is a composition of its receptive field, or sensitivity to input $X$, and its projective field, or effect on output $Y$. Measuring both is required to explain how the intermediate component contributes to the overall behavior of the system.
Figure 2: Hidden-neuron contributions in a deep convolutional network. (A) Pipeline of computing contributions for an image processed through ResNet-50. Gradient-based attribution methods are extended to compute the contributions of each hidden unit to scalar targets of network output such as entropy of logits, sum of top-$k$ logits, and individual class logits. (B) Spatial map of activations and contributions of a single channel in Layer 5. (C) Mean positive, negative and net contribution for each channel. (D--E) Same as (B--C) for Layer 8.
Figure 3: Channel contributions through the network become more sparse, single-signed and high dimensional. (A) Example matrix of spatially-summed contributions and activations from one layer for all channels and images from four classes. (B) Hoyer sparsity index for contributions and activations across network layers. (C) Scatter plot of mean negative vs. mean positive contributions of each channel to network output. (D) Correlation coefficient between positive and negative contributions of individual channels across network depth. (E) Fraction of explained variance (FOEV) across all class-averaged channel weightings for layers 2 and 14. (F) Number of components required to reach 95 percent FOEV.
Figure 4: Sparse autoencoder decomposition of network contributions. (A) Schematic of contribution decomposition. Channel contributions are spatially summed and an autoencoder is trained to reconstruct the matrix of contributions by images by creating modes with sparse loadings (the weighting for each mode for particular image). Loadings are passed through a sigmoid then a threshold, and regularized to encourage sparsity. (B) Loadings from the mode that maximally correlated with the class "panda" for contributions (top, black) and activations (bottom, blue). Inset shows the loadings for 50 images of panda and 100 images of other classes.
Figure 5: Emergence of meaningful contribution modes in intermediate layers. (A) Histograms of each mode's maximum correlation with binary class indicators for contributions (grey) and activations (blue) at hidden layer 13. (B) Same as (A) for the correlation of individual channel contributions or activations with class indicators. (C) Number of modes with a correlation to a class of greater than 0.2. (D) Mean of the maximal class-correlation as a function of layer.
...and 14 more figures

Theorems & Definitions (1)

Definition 1: Complete input space decomposition of hidden units

Causal Interpretation of Neural Network Computations with Contribution Decomposition

TL;DR

Abstract

Causal Interpretation of Neural Network Computations with Contribution Decomposition

Authors

TL;DR

Abstract

Table of Contents

Figures (19)

Theorems & Definitions (1)