Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability

Jatin Nainani

Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability

Jatin Nainani

TL;DR

The paper tackles the challenge of interpretability in large neural networks by focusing on mechanistic interpretability via circuit discovery. It evaluates Brain-Inspired Modular Training (BIMT) as a training regime designed to promote modularity and thus facilitate automated circuit discovery. Through MNIST-based empirical studies with recursive activation patching, BIMT achieves lower circuit-logit gaps, faster discovery times, and higher circuit sparsity than competing regimes, at the cost of higher training memory (largely due to neuron swaps). The findings support BIMT as a practical approach to scalable mechanistic interpretability for large models, offering concrete benefits for rapid, reliable circuit discovery and analysis, with clear trade-offs in training resources and modest inference overhead.

Abstract

Large Language Models (LLMs) have experienced a rapid rise in AI, changing a wide range of applications with their advanced capabilities. As these models become increasingly integral to decision-making, the need for thorough interpretability has never been more critical. Mechanistic Interpretability offers a pathway to this understanding by identifying and analyzing specific sub-networks or 'circuits' within these complex systems. A crucial aspect of this approach is Automated Circuit Discovery, which facilitates the study of large models like GPT4 or LLAMA in a feasible manner. In this context, our research evaluates a recent method, Brain-Inspired Modular Training (BIMT), designed to enhance the interpretability of neural networks. We demonstrate how BIMT significantly improves the efficiency and quality of Automated Circuit Discovery, overcoming the limitations of manual methods. Our comparative analysis further reveals that BIMT outperforms existing models in terms of circuit quality, discovery time, and sparsity. Additionally, we provide a comprehensive computational analysis of BIMT, including aspects such as training duration, memory allocation requirements, and inference speed. This study advances the larger objective of creating trustworthy and transparent AI systems in addition to demonstrating how well BIMT works to make neural networks easier to understand.

Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability

TL;DR

Abstract

Paper Structure (29 sections, 2 equations, 7 figures, 3 tables)

This paper contains 29 sections, 2 equations, 7 figures, 3 tables.

Introduction
Research Design
BIMT
Neural Networks
Research questions
Research Question 1: How do networks trained to be modular affect automatic circuit discovery
Research Question 2: How does the computational efficiency vary with respect to modularity?
General procedure for Mechanistic Interpretability research
Quantifying Interpretability
Recursive Activation Patching
Logit Difference
Other computational metrics
Empirical Procedure
Data and Task for Original Network
Task for Circuit
...and 14 more sections

Figures (7)

Figure 1: Demonstration of Activation Patching on Modular Trained Network. A clean input is given to the model to produce the expected behavior (detection of digit 8). The model is then given a corrupted input (digit 3) which leads to corrupted output. Weight connections that are blue and red correspond to the clean run, whereas green and yellow correspond to the corrupted run. We then iteratively copy activations from the clean network to the corrupted network. If replacing an activation $i$ on layer $j$ reduces the difference between $L\_clean_{ij}$ and $L\_corr_{ij}$, it means that neuron[i,j] is a relevant part of the circuit for this task. The figure shows the patching of neuron 47 at layer 1 to the corrupted run - represented by the blue and red lines at that neuron. This patched layer propagates as normal and produces a different logit output.
Figure 2: Comparison between original network and discovered circuits for BIMT and L1 Only model. We notice that on inference, BIMT can produce significantly lower logit differences. Red edges denote positive edges, while blue edges denote negative ones. Discovered circuits also display modularity in the case of BIMT.
Figure 3: Comparison of the sparsity of discovered circuit vs average logit difference. Each data point represents a model under comparison. The gray vertical lines represent the 95% confidence interval calculated by bootstrapping. Models in the bottom right are highly valuable, as they represent a low average logit score (showing the presence of behavior) and a high sparsity of models (avoiding redundancy in circuits). BIMT can consistently provide the highest sparsity and lowest logit differences for the task of circle detection.
Figure 4: Comparison of the sparsity of discovered circuit vs time taken for discovery. Each data point represents a model under comparison. The gray vertical lines represent the 95% confidence interval calculated by bootstrapping. BIMT can consistently discover sparser circuits with the lowest mean time of discovery. As BIMT is the sparser network to begin with, the discovery time decreases.
Figure 5: Memory allocated to CUDA GPU during training. Models with "swap" require almost 1.5 times the memory to train in comparison to other models.
...and 2 more figures

Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability

TL;DR

Abstract

Evaluating Brain-Inspired Modular Training in Automated Circuit Discovery for Mechanistic Interpretability

Authors

TL;DR

Abstract

Table of Contents

Figures (7)