Adversarial Circuit Evaluation
Niels uit de Bos, Adrià Garriga-Alonso
TL;DR
This work tackles whether curated neural circuits truly reflect the behavior of the full model by adversarially evaluating their worst-case performance. It formalizes a resample ablation-based procedure to compute the KL divergence between the full model and a circuit, and derives sample-size bounds for high-percentile guarantees. By applying the method to IOI, docstring, and greater-than circuits, the authors show that IOI and docstring circuits can deviate substantially from the full model even on benign inputs, while the greater-than circuit exhibits greater robustness. The findings highlight the need for more robust, safety-oriented circuit design and suggest integrating adversarial evaluation into circuit discovery to improve both average and tail performance under distributional shifts.
Abstract
Circuits are supposed to accurately describe how a neural network performs a specific task, but do they really? We evaluate three circuits found in the literature (IOI, greater-than, and docstring) in an adversarial manner, considering inputs where the circuit's behavior maximally diverges from the full model. Concretely, we measure the KL divergence between the full model's output and the circuit's output, calculated through resample ablation, and we analyze the worst-performing inputs. Our results show that the circuits for the IOI and docstring tasks fail to behave similarly to the full model even on completely benign inputs from the original task, indicating that more robust circuits are needed for safety-critical applications.
