From Black-Box to White-Box: Control-Theoretic Neural Network Interpretability
Jihoon Moon
TL;DR
This work introduces a control-theoretic framework for mechanistic interpretability of feedforward neural networks by interpreting them as nonlinear state-space systems and applying local linearization around an operating point. It defines input–state and hidden–output Jacobians, constructs static controllability and observability Gramians ($W_C$ and $W_O$), and analyzes a Hankel-like product $M=W_C W_O$ to extract Hankel singular values and internal modes that reveal dominant neural pathways. The approach yields neuron-level importance scores and highlights how activation saturation or operating point shifts reshape internal controllability, observability, and mode structure, as demonstrated on small SwiGLU and GELU networks. The framework offers a principled, local white-box view that can be extended to larger architectures and transformers, providing a starting point for integrating system-theoretic tools with modern deep learning interpretability research.
Abstract
Deep neural networks achieve state of the art performance but remain difficult to interpret mechanistically. In this work, we propose a control theoretic framework that treats a trained neural network as a nonlinear state space system and uses local linearization, controllability and observability Gramians, and Hankel singular values to analyze its internal computation. For a given input, we linearize the network around the corresponding hidden activation pattern and construct a state space model whose state consists of hidden neuron activations. The input state and state output Jacobians define local controllability and observability Gramians, from which we compute Hankel singular values and associated modes. These quantities provide a principled notion of neuron and pathway importance: controllability measures how easily each neuron can be excited by input perturbations, observability measures how strongly each neuron influences the output, and Hankel singular values rank internal modes that carry input output energy. We illustrate the framework on simple feedforward networks, including a 1 2 2 1 SwiGLU network and a 2 3 3 2 GELU network. By comparing different operating points, we show how activation saturation reduces controllability, shrinks the dominant Hankel singular value, and shifts the dominant internal mode to a different subset of neurons. The proposed method turns a neural network into a collection of local white box dynamical models and suggests which internal directions are natural candidates for pruning or constraints to improve interpretability.
