Adversarial Circuit Evaluation

Niels uit de Bos; Adrià Garriga-Alonso

Adversarial Circuit Evaluation

Niels uit de Bos, Adrià Garriga-Alonso

TL;DR

This work tackles whether curated neural circuits truly reflect the behavior of the full model by adversarially evaluating their worst-case performance. It formalizes a resample ablation-based procedure to compute the KL divergence between the full model and a circuit, and derives sample-size bounds for high-percentile guarantees. By applying the method to IOI, docstring, and greater-than circuits, the authors show that IOI and docstring circuits can deviate substantially from the full model even on benign inputs, while the greater-than circuit exhibits greater robustness. The findings highlight the need for more robust, safety-oriented circuit design and suggest integrating adversarial evaluation into circuit discovery to improve both average and tail performance under distributional shifts.

Abstract

Circuits are supposed to accurately describe how a neural network performs a specific task, but do they really? We evaluate three circuits found in the literature (IOI, greater-than, and docstring) in an adversarial manner, considering inputs where the circuit's behavior maximally diverges from the full model. Concretely, we measure the KL divergence between the full model's output and the circuit's output, calculated through resample ablation, and we analyze the worst-performing inputs. Our results show that the circuits for the IOI and docstring tasks fail to behave similarly to the full model even on completely benign inputs from the original task, indicating that more robust circuits are needed for safety-critical applications.

Adversarial Circuit Evaluation

TL;DR

Abstract

Paper Structure (16 sections, 2 theorems, 11 equations, 10 figures, 10 tables)

This paper contains 16 sections, 2 theorems, 11 equations, 10 figures, 10 tables.

Introduction
Methodology
Resample ablation
Evaluation metrics
How many samples are needed?
Results
Comparing the KL divergence distributions
Docstring
Greater-than
IOI
Discussion
Future Work
Tables of worst-performing input points
Results with less adversarial patch inputs
Proof of Percentile Bounds
...and 1 more sections

Key Result

Proposition 2.1

The probability $\Pr(\hat{x}_p \geq x_p)$ that $\hat{x}_p$ is an upper bound for the true $p$-th percentile $x_p$ of $X$ can be calculated as where $F_{\rm Binom}(x; n, p)$ is the cumulative distribution function of the binomial distribution with parameters $n$ and $p$.

Figures (10)

Figure 1: A histogram of the KL divergence for the IOI task. The x-axis shows the KL divergence between the model's output and the circuit's output on an input-corrupted-input pair, and the y-axis shows the number of input-corrupted-input pairs from our random sample of 1 million points that fall into each bin. There are 100 bins of equal size between the values of 0 and the maximum KL divergence achieved. Summary statistics of the plotted distribution are displayed in \ref{['table:percentiles']}.
Figure 2: A histogram of the KL divergence for the greater-than task.
Figure 3: A histogram of the KL divergence for the docstring task.
Figure 4: A histogram of the KL divergence for the IOI task, where all input-corruputed-input pairs are matched in the same way as in the original dataset, i.e., with the same location and object in the corrupted input as in the clean input. The x-axis shows the KL divergence between the model's output and the circuit's output on an input-corrupted-input pair, and the y-axis shows the number of input-corrupted-input pairs from our random sample of 1 million points that fall into each bin. There are 100 bins of equal size between the values of 0 and the maximum KL divergence achieved. Summary statistics of the plotted distribution are displayed in \ref{['table:percentiles']}.
Figure 5: A histogram of the KL divergence for the greater-than task where all input-corruputed-input pairs are matched in the same way as in the original dataset, i.e., with the same event and first two digits in the corrupted input as in the clean input.
...and 5 more figures

Theorems & Definitions (4)

Proposition 2.1
Corollary 2.2
proof : Proof of \ref{['thm:percentile-bound-as-binomial']}
proof : Proof of \ref{['cor:percentile-bound-as-chernoff-hoeffding']}

Adversarial Circuit Evaluation

TL;DR

Abstract

Adversarial Circuit Evaluation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (4)