IBCircuit: Towards Holistic Circuit Discovery with Information Bottleneck

Tian Bian; Yifan Niu; Chaohao Yuan; Chengzhi Piao; Bingzhe Wu; Long-Kai Huang; Yu Rong; Tingyang Xu; Hong Cheng; Jia Li

IBCircuit: Towards Holistic Circuit Discovery with Information Bottleneck

Tian Bian, Yifan Niu, Chaohao Yuan, Chengzhi Piao, Bingzhe Wu, Long-Kai Huang, Yu Rong, Tingyang Xu, Hong Cheng, Jia Li

TL;DR

IBCircuit is an optimization framework for holistic circuit discovery that can be applied to any given task without tediously corrupted activation design and identifies more faithful and minimal circuits in terms of critical node components and edge components compared to recent related work.

Abstract

Circuit discovery has recently attracted attention as a potential research direction to explain the non-trivial behaviors of language models. It aims to find the computational subgraphs, also known as circuits, within the model that are responsible for solving specific tasks. However, most existing studies overlook the holistic nature of these circuits and require designing specific corrupted activations for different tasks, which is inaccurate and inefficient. In this work, we propose an end-to-end approach based on the principle of Information Bottleneck, called IBCircuit, to identify informative circuits holistically. IBCircuit is an optimization framework for holistic circuit discovery and can be applied to any given task without tediously corrupted activation design. In both the Indirect Object Identification (IOI) and Greater-Than tasks, IBCircuit identifies more faithful and minimal circuits in terms of critical node components and edge components compared to recent related work.

IBCircuit: Towards Holistic Circuit Discovery with Information Bottleneck

TL;DR

Abstract

Paper Structure (30 sections, 2 theorems, 27 equations, 9 figures)

This paper contains 30 sections, 2 theorems, 27 equations, 9 figures.

Introduction
Related Work
Circuit Analysis
Information Bottleneck
Preliminaries
Neural Circuits
Information Bottleneck
IBCircuit
Intuition: Informative Circuit
Estimation of Mutual Information
Circuit Parameterization
Objective for Training
Circuit Formation
Experiments
Experiment Setting
...and 15 more sections

Key Result

Proposition 4.1

For the output $Y$ of the original transformer language model $\mathcal{G}$, the output $Y_{\mathcal{C}}$ of the given circuit ${\mathcal{C}}$, the variational lower bound can be written as: where $D_{KL}( \cdot || \cdot)$ is the Kullback-Leibler Divergence, $H(\cdot)$ is the entropy.

Figures (9)

Figure 1: (a) Transformer blocks from the perspective of the residual stream. (b) Adding Gaussian noise to the activations of attention heads using node-wise IB weights and optimizing through the Information Bottleneck. (c) Selecting attention heads with less noise as the node-level circuit. (d) Adding Gaussian noise to the activations of source nodes using edge-wise IB weights and optimizing through the Information Bottleneck. (e) Selecting edges with less noise as the edge-level circuit.
Figure 2: ROC curves of SP, ACDC, AP and IBCircuit identifying model components from previous work, across IOI circuit and Greater-Than circuit.
Figure 3: Comparison of the impact of different trade-off coefficients $\alpha$ on IBCircuit in implementing the IOI task.
Figure 4: Comparison of IBCircuit and ablation variants in terms of Greater Probability and KL Divergence metrics.
Figure 5: Comparison of IBCircuit and related methods in terms of Logit Difference and Greater Probability metrics under different node number thresholds. Higher metric scores and fewer nodes correspond to better circuits.
...and 4 more figures

Theorems & Definitions (3)

Proposition 4.1: Variational lower bound of $I(Y ; \mathcal{C})$
Proposition 4.2: Variational upper bound of $I(\mathcal{G} ; \mathcal{C})$
proof

IBCircuit: Towards Holistic Circuit Discovery with Information Bottleneck

TL;DR

Abstract

IBCircuit: Towards Holistic Circuit Discovery with Information Bottleneck

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (3)