Analyzing Memorization in Large Language Models through the Lens of Model Attribution

Tarun Ram Menta; Susmit Agrawal; Chirag Agarwal

Analyzing Memorization in Large Language Models through the Lens of Model Attribution

Tarun Ram Menta, Susmit Agrawal, Chirag Agarwal

TL;DR

The paper addresses memorization in large language models by viewing it through an architectural lens and using attention short-circuiting to isolate the role of individual transformer blocks. It introduces a principled method to bypass attention while preserving other components, supported by theoretical bounds on output deviations and extensive experiments on Pythia and GPT-Neo models. Theoretical results suggest deeper attention blocks mainly drive memorization, while earlier blocks are critical for generalization and reasoning, a claim corroborated by empirical findings that late-block short-circuiting reduces memorization with minimal performance loss. Practically, this work offers a feasible mitigation strategy for safer LLM deployment and motivates future architectural adjustments to balance memorization and generalization without sacrificing capability.

Abstract

Large Language Models (LLMs) are prevalent in modern applications but often memorize training data, leading to privacy breaches and copyright issues. Existing research has mainly focused on posthoc analyses, such as extracting memorized content or developing memorization metrics, without exploring the underlying architectural factors that contribute to memorization. In this work, we investigate memorization from an architectural lens by analyzing how attention modules at different layers impact its memorization and generalization performance. Using attribution techniques, we systematically intervene in the LLM architecture by bypassing attention modules at specific blocks while keeping other components like layer normalization and MLP transformations intact. We provide theorems analyzing our intervention mechanism from a mathematical view, bounding the difference in layer outputs with and without our attributions. Our theoretical and empirical analyses reveal that attention modules in deeper transformer blocks are primarily responsible for memorization, whereas earlier blocks are crucial for the models generalization and reasoning capabilities. We validate our findings through comprehensive experiments on different LLM families (Pythia and GPTNeo) and five benchmark datasets. Our insights offer a practical approach to mitigate memorization in LLMs while preserving their performance, contributing to safer and more ethical deployment in real world applications.

Analyzing Memorization in Large Language Models through the Lens of Model Attribution

TL;DR

Abstract

Paper Structure (24 sections, 4 theorems, 56 equations, 22 figures)

This paper contains 24 sections, 4 theorems, 56 equations, 22 figures.

Introduction
Related Work
Methodology
Preliminaries
Attention Short-Circuiting
Theoretical Analysis
Experiments
Experimental Setup
Results
Memorization vs. Performance
Short-Circuiting across Model Scale
Qualitative Analysis
Reasoning vs. Non-reasoning Tasks
Conclusion
Limitations and Future Directions
...and 9 more sections

Key Result

Theorem 1

The normed difference between the output vectors obtained by using standard attention and short-circuited attention for a single transformer block is upper bounded by: where $M = \max_{i \ne l} \| x_n - x_i \|$, $\mathbf{W}$ is the weight matrix of the FFN layer, and $\epsilon$ is the error in the linear approximation of the activation function.

Figures (22)

Figure 1: Memorization vs. Downstream Performance for GPTNeo-1.3B with short-circuiting applied to the attention mechanism of each block independently. When applied to later blocks of the model, attention short-circuiting consistently yields lower memorization scores, while maintaining downstream benchmark scores. Results on all models and datasets in Appendix \ref{['app:results']}.
Figure 2: Completion Entropy Scores for Memorized and Non Memorized Samples. Model used is GPTNeo-1.3B. Memorized samples consistently show significantly lower completion entropy scores.
Figure 3: Results for short-circuiting the attention mechanism in each block of Pythia models across four scales - 1.4B, 2.8B, 6.9B, 12B. Short-circuiting in later blocks of the LLM consistently shows lower memorization with a negligible drop in downstream benchmark performance on PIQA across all model scales. Additional results on other benchmarks can be found in the Appendix \ref{['app:results']}.
Figure 4: Samples generated using greedy and temperature sampling by GPTNeo-1.3B when short-circuiting attention at different blocks. Attention short-circuiting in earlier blocks leads to model collapse, while applying it to later blocks still results in coherent generations. Query prompt is taken from the GPT-2 paper Radford2019LanguageMA.
Figure 5: Wikitext word perplexity scores for GPTNeo models with attention short-circuiting applied to each block. Short-circuiting attention in later blocks results in minimal degradation in perplexity scores. Dashed lines indicate vanilla model performance.
...and 17 more figures

Theorems & Definitions (6)

Theorem 1: Bounding the Difference Between Identity Attention and Standard Attention for a Transformer with a single block
Theorem 2: Impact of Replacing Attention in any two consecutive Transformer Layers
Theorem 1: Bounding the Difference Between Identity Attention and Standard Attention for a Transformer with a single block
proof
Theorem 2: Impact of Replacing Attention in any two consecutive Transformer Layers
proof

Analyzing Memorization in Large Language Models through the Lens of Model Attribution

TL;DR

Abstract

Analyzing Memorization in Large Language Models through the Lens of Model Attribution

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (22)

Theorems & Definitions (6)