Mechanistic Interpretability of GPT-like Models on Summarization Tasks
Anurag Mishra
TL;DR
This work proposes a mechanistic interpretability framework for summarization in GPT-like models by quantifying internal transformations with metrics $KL(P||Q)$, $H(A)$, and $ActMag(l)$, and by instrumenting a pipeline to locate summarization circuits. It identifies a middle-layer circuit spanning layers 2, 3, and 5 and demonstrates that targeted LoRA on this circuit yields faster convergence, a 75% reduction in trainable parameters, and superior ROUGE scores compared with standard LoRA. The approach bridges black-box evaluation and mechanistic understanding, showing that summarization relies on reorganizing representation geometry in middle layers. This framework enables efficient, scalable analysis of adaptation mechanisms across model scales and architectures, and points toward causal validation of internal pathways for information selection and compression.
Abstract
Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits achieves significant performance improvement over standard LoRA fine-tuning while requiring fewer training epochs. This work bridges the gap between black-box evaluation and mechanistic understanding, providing insights into how neural networks perform information selection and compression during summarization.
