Table of Contents
Fetching ...

Mechanistic Interpretability of GPT-like Models on Summarization Tasks

Anurag Mishra

TL;DR

This work proposes a mechanistic interpretability framework for summarization in GPT-like models by quantifying internal transformations with metrics $KL(P||Q)$, $H(A)$, and $ActMag(l)$, and by instrumenting a pipeline to locate summarization circuits. It identifies a middle-layer circuit spanning layers 2, 3, and 5 and demonstrates that targeted LoRA on this circuit yields faster convergence, a 75% reduction in trainable parameters, and superior ROUGE scores compared with standard LoRA. The approach bridges black-box evaluation and mechanistic understanding, showing that summarization relies on reorganizing representation geometry in middle layers. This framework enables efficient, scalable analysis of adaptation mechanisms across model scales and architectures, and points toward causal validation of internal pathways for information selection and compression.

Abstract

Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits achieves significant performance improvement over standard LoRA fine-tuning while requiring fewer training epochs. This work bridges the gap between black-box evaluation and mechanistic understanding, providing insights into how neural networks perform information selection and compression during summarization.

Mechanistic Interpretability of GPT-like Models on Summarization Tasks

TL;DR

This work proposes a mechanistic interpretability framework for summarization in GPT-like models by quantifying internal transformations with metrics , , and , and by instrumenting a pipeline to locate summarization circuits. It identifies a middle-layer circuit spanning layers 2, 3, and 5 and demonstrates that targeted LoRA on this circuit yields faster convergence, a 75% reduction in trainable parameters, and superior ROUGE scores compared with standard LoRA. The approach bridges black-box evaluation and mechanistic understanding, showing that summarization relies on reorganizing representation geometry in middle layers. This framework enables efficient, scalable analysis of adaptation mechanisms across model scales and architectures, and points toward causal validation of internal pathways for information selection and compression.

Abstract

Mechanistic interpretability research seeks to reveal the inner workings of large language models, yet most work focuses on classification or generative tasks rather than summarization. This paper presents an interpretability framework for analyzing how GPT-like models adapt to summarization tasks. We conduct differential analysis between pre-trained and fine-tuned models, quantifying changes in attention patterns and internal activations. By identifying specific layers and attention heads that undergo significant transformation, we locate the "summarization circuit" within the model architecture. Our findings reveal that middle layers (particularly 2, 3, and 5) exhibit the most dramatic changes, with 62% of attention heads showing decreased entropy, indicating a shift toward focused information selection. We demonstrate that targeted LoRA adaptation of these identified circuits achieves significant performance improvement over standard LoRA fine-tuning while requiring fewer training epochs. This work bridges the gap between black-box evaluation and mechanistic understanding, providing insights into how neural networks perform information selection and compression during summarization.

Paper Structure

This paper contains 13 sections, 6 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Mechanistic Interpretability Framework: The model processes input documents through layer-wise transformations for analysis of internal mechanisms during summarization.
  • Figure 2: KL Divergence Heatmap illustrating differences between pre-trained and fine-tuned GPT-2 attention distributions.
  • Figure 3: Attention Entropy difference (fine-tuned minus pre-trained) heatmap highlighting increased (positive) or decreased (negative) attention focus.
  • Figure 4: KL Divergence comparison across model layers between different adaptation strategies: Base vs. Fine-tuned (blue), Fine-tuned vs. LoRA (red), and Base vs. LoRA (yellow).
  • Figure 5: Neuron-level activation changes in Layer 5 comparing pre-training and post-fine-tuning states. The substantial differences for specific neurons (particularly neuron 304) reveal specialized adaptation for summarization tasks.
  • ...and 1 more figures