Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning
Lin Zhang, Lijie Hu, Di Wang
TL;DR
The paper tackles the challenge of understanding multi-step reasoning in transformer models by introducing SICAF, a mechanistic interpretability framework that combines automatic circuit discovery with layerwise self-influence analysis. Through automatic circuit-finding methods (EAP, EAP-IG, EAP-IG-KL) applied to a GPT-2 model fine-tuned on the IOI task, SICAF identifies small, faithful circuits and then computes self-influence $I_H(x, x)$ across layers to map the model's reasoning path. The authors demonstrate a hierarchical, human-interpretable reasoning structure, with key entities and actions identified in early and final layers, and show that EAP-IG and EAP-IG-KL yield more balanced and robust reasoning traces than vanilla EAP. The approach offers a scalable, interpretable window into transformer reasoning that could inform design choices for more reliable, transparent reasoning systems in NLP.
Abstract
Transformer-based language models have achieved significant success; however, their internal mechanisms remain largely opaque due to the complexity of non-linear interactions and high-dimensional operations. While previous studies have demonstrated that these models implicitly embed reasoning trees, humans typically employ various distinct logical reasoning mechanisms to complete the same task. It is still unclear which multi-step reasoning mechanisms are used by language models to solve such tasks. In this paper, we aim to address this question by investigating the mechanistic interpretability of language models, particularly in the context of multi-step reasoning tasks. Specifically, we employ circuit analysis and self-influence functions to evaluate the changing importance of each token throughout the reasoning process, allowing us to map the reasoning paths adopted by the model. We apply this methodology to the GPT-2 model on a prediction task (IOI) and demonstrate that the underlying circuits reveal a human-interpretable reasoning process used by the model.
