Towards Combinatorial Interpretability of Neural Computation

Micah Adler; Dan Alistarh; Nir Shavit

Towards Combinatorial Interpretability of Neural Computation

Micah Adler, Dan Alistarh, Nir Shavit

TL;DR

The paper introduces combinatorial interpretability as a fundamentally different lens for neural computation, focusing on sign-based weight structures and the feature channel coding (FCC) hypothesis to reveal exact, static mechanisms by which networks compute Boolean functions. By decomposing weight matrices and analyzing feature codes, the authors demonstrate that gradient-descent trained networks develop identifiable FCCs, enabling direct extraction of the circuit and decoding its computations without activations or auxiliary autoencoders. They quantify how coding capacity governs scaling laws, show robust patterns across DNFs, CNFs, and a one-dimensional vision task, and propose a cascading disentanglement framework to extend the analysis to deeper networks. The approach offers a principled, mechanistic path to understand neural computation and provides potential foundations for both artificial and biological circuits, with implications for scalability, sparsity, and interpretability research. Overall, FCC advances a rigorous, code-centric view of neural computation that complements geometry-based perspectives and supplies concrete tools for static, weight-based mechanistic interpretation.

Abstract

We introduce combinatorial interpretability, a methodology for understanding neural computation by analyzing the combinatorial structures in the sign-based categorization of a network's weights and biases. We demonstrate its power through feature channel coding, a theory that explains how neural networks compute Boolean expressions and potentially underlies other categories of neural network computation. According to this theory, features are computed via feature channels: unique cross-neuron encodings shared among the inputs the feature operates on. Because different feature channels share neurons, the neurons are polysemantic and the channels interfere with one another, making the computation appear inscrutable. We show how to decipher these computations by analyzing a network's feature channel coding, offering complete mechanistic interpretations of several small neural networks that were trained with gradient descent. Crucially, this is achieved via static combinatorial analysis of the weight matrices, without examining activations or training new autoencoding networks. Feature channel coding reframes the superposition hypothesis, shifting the focus from neuron activation directionality in high-dimensional space to the combinatorial structure of codes. It also allows us for the first time to exactly quantify and explain the relationship between a network's parameter size and its computational capacity (i.e. the set of features it can compute with low error), a relationship that is implicitly at the core of many modern scaling laws. Though our initial studies of feature channel coding are restricted to Boolean functions, we believe they provide a rich, controlled, and informative research space, and that the path we propose for combinatorial interpretation of neural computation can provide a basis for understanding both artificial and biological neural circuits.

Towards Combinatorial Interpretability of Neural Computation

TL;DR

Abstract

Towards Combinatorial Interpretability of Neural Computation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (24)