Table of Contents
Fetching ...

Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning

Lin Zhang, Lijie Hu, Di Wang

TL;DR

The paper tackles the challenge of understanding multi-step reasoning in transformer models by introducing SICAF, a mechanistic interpretability framework that combines automatic circuit discovery with layerwise self-influence analysis. Through automatic circuit-finding methods (EAP, EAP-IG, EAP-IG-KL) applied to a GPT-2 model fine-tuned on the IOI task, SICAF identifies small, faithful circuits and then computes self-influence $I_H(x, x)$ across layers to map the model's reasoning path. The authors demonstrate a hierarchical, human-interpretable reasoning structure, with key entities and actions identified in early and final layers, and show that EAP-IG and EAP-IG-KL yield more balanced and robust reasoning traces than vanilla EAP. The approach offers a scalable, interpretable window into transformer reasoning that could inform design choices for more reliable, transparent reasoning systems in NLP.

Abstract

Transformer-based language models have achieved significant success; however, their internal mechanisms remain largely opaque due to the complexity of non-linear interactions and high-dimensional operations. While previous studies have demonstrated that these models implicitly embed reasoning trees, humans typically employ various distinct logical reasoning mechanisms to complete the same task. It is still unclear which multi-step reasoning mechanisms are used by language models to solve such tasks. In this paper, we aim to address this question by investigating the mechanistic interpretability of language models, particularly in the context of multi-step reasoning tasks. Specifically, we employ circuit analysis and self-influence functions to evaluate the changing importance of each token throughout the reasoning process, allowing us to map the reasoning paths adopted by the model. We apply this methodology to the GPT-2 model on a prediction task (IOI) and demonstrate that the underlying circuits reveal a human-interpretable reasoning process used by the model.

Mechanistic Unveiling of Transformer Circuits: Self-Influence as a Key to Model Reasoning

TL;DR

The paper tackles the challenge of understanding multi-step reasoning in transformer models by introducing SICAF, a mechanistic interpretability framework that combines automatic circuit discovery with layerwise self-influence analysis. Through automatic circuit-finding methods (EAP, EAP-IG, EAP-IG-KL) applied to a GPT-2 model fine-tuned on the IOI task, SICAF identifies small, faithful circuits and then computes self-influence across layers to map the model's reasoning path. The authors demonstrate a hierarchical, human-interpretable reasoning structure, with key entities and actions identified in early and final layers, and show that EAP-IG and EAP-IG-KL yield more balanced and robust reasoning traces than vanilla EAP. The approach offers a scalable, interpretable window into transformer reasoning that could inform design choices for more reliable, transparent reasoning systems in NLP.

Abstract

Transformer-based language models have achieved significant success; however, their internal mechanisms remain largely opaque due to the complexity of non-linear interactions and high-dimensional operations. While previous studies have demonstrated that these models implicitly embed reasoning trees, humans typically employ various distinct logical reasoning mechanisms to complete the same task. It is still unclear which multi-step reasoning mechanisms are used by language models to solve such tasks. In this paper, we aim to address this question by investigating the mechanistic interpretability of language models, particularly in the context of multi-step reasoning tasks. Specifically, we employ circuit analysis and self-influence functions to evaluate the changing importance of each token throughout the reasoning process, allowing us to map the reasoning paths adopted by the model. We apply this methodology to the GPT-2 model on a prediction task (IOI) and demonstrate that the underlying circuits reveal a human-interpretable reasoning process used by the model.

Paper Structure

This paper contains 25 sections, 12 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: (a) A simplified illustration of circuits within the model. (b) An example of how a language model (LLM) tackles a reasoning task, such as the Indirect Object Identification (IOI) puzzle. The model identifies key entities, actions, and pronouns to deduce the recipient. The reasoning process involves steps like extracting names, determining actions, and linking pronouns to objects to reach the answer "Amy."
  • Figure 2: Comparison of normalized faithfulness, number of nodes, and parameter percentage for circuits identified by EAP, EAP-IG, and EAP-IG-KL on the IOI task. The x-axis represents the number of edges included, and each panel shows different metrics: normalized faithfulness (left), number of nodes (middle), and parameter percentage (right).
  • Figure 3: Heatmap of node importance across layers for EAP, EAP-IG, and EAP-IG-KL methods. The x-axis shows the number of edges included, and the y-axis shows the layers. Darker colors represent higher node importance, with EAP focusing on the last layers and EAP-IG, EAP-IG-KL showing more balanced distributions across layers.
  • Figure 4: Self-influence scores of key tokens across model layers for the EAP, EAP-IG, and EAP-IG-KL methods on the IOI task. Each subplot represents the distribution of self-influence for individual tokens across the 12 layers of the GPT-2 model. EAP shows concentrated influence in the early and final layers, while EAP-IG and EAP-IG-KL display more balanced self-influence across layers, reflecting a structured progression of token importance. Key tokens such as "Christina," "Amy," and "gave" consistently show high self-influence, demonstrating their significance in the reasoning process.
  • Figure 5: Heatmap of node importance across layers for EAP, EAP-IG, and EAP-IG-KL methods. The x-axis represents the layer indices, and the y-axis shows the tokens. Darker colors indicate higher node importance, with EAP focusing on the last layers and EAP-IG and EAP-IG-KL exhibiting a more balanced distribution across layers.