Knowledge Circuits in Pretrained Transformers

Yunzhi Yao; Ningyu Zhang; Zekun Xi; Mengru Wang; Ziwen Xu; Shumin Deng; Huajun Chen

Knowledge Circuits in Pretrained Transformers

Yunzhi Yao, Ningyu Zhang, Zekun Xi, Mengru Wang, Ziwen Xu, Shumin Deng, Huajun Chen

TL;DR

This work introduces Knowledge Circuits, a circuit-theoretic view of Transformer computation that treats knowledge as emergent from subgraphs spanning MLPs, attention heads, and embeddings. By causally ablating edges in the computation graph, the authors construct task-specific circuits that predict target entities from subject-relation prompts, and analyze how information flows through mover and relation heads. They show that compact circuits can retain most of the model’s knowledge, reveal mechanisms for knowledge editing, and help interpret phenomena such as factual hallucinations and in-context learning. The results suggest that knowledge circuits offer a concrete framework for understanding and improving how Transformers store, edit, and utilize knowledge, with practical implications for reducing hallucinations and guiding safer, more reliable editing strategies.

Abstract

The remarkable capabilities of modern large language models are rooted in their vast repositories of knowledge encoded within their parameters, enabling them to perceive the world and engage in reasoning. The inner workings of how these models store knowledge have long been a subject of intense interest and investigation among researchers. To date, most studies have concentrated on isolated components within these models, such as the Multilayer Perceptrons and attention head. In this paper, we delve into the computation graph of the language model to uncover the knowledge circuits that are instrumental in articulating specific knowledge. The experiments, conducted with GPT2 and TinyLLAMA, have allowed us to observe how certain information heads, relation heads, and Multilayer Perceptrons collaboratively encode knowledge within the model. Moreover, we evaluate the impact of current knowledge editing techniques on these knowledge circuits, providing deeper insights into the functioning and constraints of these editing methodologies. Finally, we utilize knowledge circuits to analyze and interpret language model behaviors such as hallucinations and in-context learning. We believe the knowledge circuits hold potential for advancing our understanding of Transformers and guiding the improved design of knowledge editing. Code and data are available in https://github.com/zjunlp/KnowledgeCircuits.

Knowledge Circuits in Pretrained Transformers

TL;DR

Abstract

Paper Structure (49 sections, 5 equations, 13 figures, 6 tables)

This paper contains 49 sections, 5 equations, 13 figures, 6 tables.

Introduction
Background: Circuit Theory
Preliminaries
Circuit Discovery
Knowledge Circuits Discovery in Transformers
Knowledge Circuits Construction
Knowledge Circuits Information Analysis
Knowledge Circuits Experimental Settings
Implementations.
Metrics.
Dataset.
Knowledge Circuits Unveil Implicit Neural Knowledge Representations
Knowledge Circuits Evaluation.
Special Components in Knowledge Circuits.
A Running Example of Knowledge Circuit.
...and 34 more sections

Figures (13)

Figure 1: Knowledge circuit obtained from "The official language of France is French" in GPT2-Medium. Left: a simplified circuit and the whole circuit is in Figure \ref{['fig:French_circuit']} in Appendix. We use $\dashrightarrow$ to skip some complex connections between nodes. Here, L15H0 means the first attention head in the 15th layer and MLP12 means the multi-perception layer in the 13th layer. Right: the behavior of several special heads. The matrix on the left is the attention pattern of each attention head and the right heapmap demonstrates the output logits of the hean by mapping to the vocabulary space.
Figure 2: The activated circuit component distributions in Layers in GPT2-Medium.
Figure 3: The rank and probability of the target entity $o$ at both the last subject token and the last token position when unembedding the intermediate layer's output for the fact "The official language of France is French".
Figure 4: Different behaviors when we edit the language model. In the original model, we can see the mover head L15H3 actually move the original token "Controller" and other information, while for ROME, we observe the mover head select the correct information "Intel", which means ROME successfully added the "Intel" to model. For the FT layer-0 editing, we can find this method directly write the edited knowledge into edited component. However, we find these two editing methods would affect other unrelated input "Windows server is created by?"
Figure 5: Left: fact hallucination case "The official currency of Malaysia is called", we observe that, at layer 15, the Mover Head selects incorrect information. Right: In-context learning case, we notice that some new heads focusing on the demonstration appear in the knowledge circuit.
...and 8 more figures

Knowledge Circuits in Pretrained Transformers

TL;DR

Abstract

Knowledge Circuits in Pretrained Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (13)