On the limits of neural network explainability via descrambling

Shashank Sule; Richard G. Spencer; Wojciech Czaja

On the limits of neural network explainability via descrambling

Shashank Sule, Richard G. Spencer, Wojciech Czaja

TL;DR

The paper investigates neural network descrambling as a rigorous method for explaining trained weights by applying descramblers on layer preactivations to minimize an explainability loss. It builds a theoretical bridge between the smoothness-based descrambling objective and the Brockett cost, showing that, in the large-data limit, optimal descramblers converge to $P \to T U^{\top}$, where $T$ and $U$ come from the eigen- and singular-value decompositions of relevant matrices. The authors develop results for general settings and specialized cases, including isotropic data, SMG linear networks, and CNNs, and demonstrate that eigendecompositions can recover interpretable motifs such as notch filters and Chebyshev bases, aligning with practical observations in DEERNet experiments. They also discuss non-uniqueness of descramblers and propose practical computation strategies, highlighting the potential of SVD-based explainability for operator-learning and physics-informed NNs. Overall, the work advances a spectral, mathematically grounded view of NN explainability and identifies concrete directions for extending the framework to broader loss functions and symmetry groups.

Abstract

We characterize the exact solutions to neural network descrambling--a mathematical model for explaining the fully connected layers of trained neural networks (NNs). By reformulating the problem to the minimization of the Brockett function arising in graph matching and complexity theory we show that the principal components of the hidden layer preactivations can be characterized as the optimal explainers or descramblers for the layer weights, leading to descrambled weight matrices. We show that in typical deep learning contexts these descramblers take diverse and interesting forms including (1) matching largest principal components with the lowest frequency modes of the Fourier basis for isotropic hidden data, (2) discovering the semantic development in two-layer linear NNs for signal recovery problems, and (3) explaining CNNs by optimally permuting the neurons. Our numerical experiments indicate that the eigendecompositions of the hidden layer data--now understood as the descramblers--can also reveal the layer's underlying transformation. These results illustrate that the SVD is more directly related to the explainability of NNs than previously thought and offers a promising avenue for discovering interpretable motifs for the hidden action of NNs, especially in contexts of operator learning or physics-informed NNs, where the input/output data has limited human readability.

On the limits of neural network explainability via descrambling

TL;DR

, where

and

come from the eigen- and singular-value decompositions of relevant matrices. The authors develop results for general settings and specialized cases, including isotropic data, SMG linear networks, and CNNs, and demonstrate that eigendecompositions can recover interpretable motifs such as notch filters and Chebyshev bases, aligning with practical observations in DEERNet experiments. They also discuss non-uniqueness of descramblers and propose practical computation strategies, highlighting the potential of SVD-based explainability for operator-learning and physics-informed NNs. Overall, the work advances a spectral, mathematically grounded view of NN explainability and identifies concrete directions for extending the framework to broader loss functions and symmetry groups.

Abstract

Paper Structure (13 sections, 7 theorems, 34 equations, 4 figures, 1 table)

This paper contains 13 sections, 7 theorems, 34 equations, 4 figures, 1 table.

Introduction
Neural Network Descrambling
Goals of this work and summary of results
Results
Neural network descrambling: general results
Variations in input distribution and network architecture
NN descrambling in the presence of training
Descrambling as rearrangement of neurons
Numerical results
Non-uniqueness of descramblers
Assessing the validity of interpretation
Discovering motifs within singular vectors of higher layers
Conclusion

Key Result

Lemma 1

Fix $S \in \mathbb{R}^{d \times n}$ and $A \in \mathbb{R}^{d \times d}$ such that $A^{\top} = A$. Define $\widehat{P}$ to be the solution to the following minimization problem over the orthogonal group $O(d)$: Let $A = T \Omega T^{\top}$ be an eigendecomposition of $A$ sorted in ascending order of the eigenvalues where $\Omega$ is the diagonal matrix of eigenvalues of A given by $\{\omega_{i}\}_{

Figures (4)

Figure 1: Left to Right: The descamblers $\widehat{P}_{SC}(1,X,N)$ were computed using four different strategies (a) Projected Gradient Descent (PGD) on \ref{['eq: Cayley gradient flow']} with $P(Q_0) = I$, (b) Warm start, i.e PGD with $P(Q_0) = T$, (c) Rescaled PGD where $D^{\top}D = T\Omega T^{\top}$ was replaced with a diagonal matrix $\Omega$ and solutions were interpreted through multiplication by $T$, and (d) Direct eigendecomposition of $S = f_1(X) = W_1X$.
Figure 2: Top: Descrambled weight matrices $\widehat{P}_{SC}(1,X,N)$ computed for the four strategies outlined in Figure \ref{['fig: descramblers_5panel']}. Notably, while the descramblers are quite different, the descrambled weights themselves are quite similar. Bottom: This similarity among explanations can be quantified by moving to the Fourier domain where nearly all descrambled weights have a notch at the zero frequency and a bandpass filter along the output dimension in approximately the $\omega_1 \in [-10,10]$ frequency band.
Figure 3: Left: The 2D FFT of the singular vectors of the discretized integral kernel in \ref{['eq: deer equation']}. Right: The 2D FFT of the scaled right singular vectors of the trained weights. Note that the panel on the right resembles a noisier version of the left panel, suggesting that the first layer of the NN acts as a pseudo-inverse.
Figure 4: The top five principal components $\{u^{i}_{S}\}_{i=1}^{5}$ of the second layer preactivation data $S = W_2 f_1(x)$ resemble a system of orthogonal polynomials. In amey2021neural a powerful plausibility argument suggested these can be fit by Chebyshev polynomials. Bottom right: After shifting and rescaling the principal components, the inner product matrix between the top 20 Chebyshev polynomials $C_n$ and the principal components $u^{i}_{S}$ is approximately banded.

Theorems & Definitions (18)

Lemma 1
Remark 1
proof
Remark 2
Theorem 1
proof
Theorem 2: Isotropic data
proof
Remark 3
Remark 4
...and 8 more

On the limits of neural network explainability via descrambling

TL;DR

Abstract

On the limits of neural network explainability via descrambling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (18)