Table of Contents
Fetching ...

Detecting Memorization in Large Language Models

Eduardo Slonski

TL;DR

This work tackles memorization in large language models by shifting from output-based signals to internal neuron activations. It introduces a mechanism-focused pipeline that identifies activations separating memorized from non-memorized tokens, and trains probes that achieve near-perfect detection accuracy, enabling interventions that suppress memorization without harming general performance. The approach extends to repetition and uncovers a certainty-related activation, offering interpretable insights into how internal mechanisms manifest. Practically, this enables scalable token labeling, improved evaluation integrity, and targeted control of model behavior for safer, more reliable AI systems.

Abstract

Large language models (LLMs) have achieved impressive results in natural language processing but are prone to memorizing portions of their training data, which can compromise evaluation metrics, raise privacy concerns, and limit generalization. Traditional methods for detecting memorization rely on output probabilities or loss functions, often lacking precision due to confounding factors like common language patterns. In this paper, we introduce an analytical method that precisely detects memorization by examining neuron activations within the LLM. By identifying specific activation patterns that differentiate between memorized and not memorized tokens, we train classification probes that achieve near-perfect accuracy. The approach can also be applied to other mechanisms, such as repetition, as demonstrated in this study, highlighting its versatility. Intervening on these activations allows us to suppress memorization without degrading overall performance, enhancing evaluation integrity by ensuring metrics reflect genuine generalization. Additionally, our method supports large-scale labeling of tokens and sequences, crucial for next-generation AI models, improving training efficiency and results. Our findings contribute to model interpretability and offer practical tools for analyzing and controlling internal mechanisms in LLMs.

Detecting Memorization in Large Language Models

TL;DR

This work tackles memorization in large language models by shifting from output-based signals to internal neuron activations. It introduces a mechanism-focused pipeline that identifies activations separating memorized from non-memorized tokens, and trains probes that achieve near-perfect detection accuracy, enabling interventions that suppress memorization without harming general performance. The approach extends to repetition and uncovers a certainty-related activation, offering interpretable insights into how internal mechanisms manifest. Practically, this enables scalable token labeling, improved evaluation integrity, and targeted control of model behavior for safer, more reliable AI systems.

Abstract

Large language models (LLMs) have achieved impressive results in natural language processing but are prone to memorizing portions of their training data, which can compromise evaluation metrics, raise privacy concerns, and limit generalization. Traditional methods for detecting memorization rely on output probabilities or loss functions, often lacking precision due to confounding factors like common language patterns. In this paper, we introduce an analytical method that precisely detects memorization by examining neuron activations within the LLM. By identifying specific activation patterns that differentiate between memorized and not memorized tokens, we train classification probes that achieve near-perfect accuracy. The approach can also be applied to other mechanisms, such as repetition, as demonstrated in this study, highlighting its versatility. Intervening on these activations allows us to suppress memorization without degrading overall performance, enhancing evaluation integrity by ensuring metrics reflect genuine generalization. Additionally, our method supports large-scale labeling of tokens and sequences, crucial for next-generation AI models, improving training efficiency and results. Our findings contribute to model interpretability and offer practical tools for analyzing and controlling internal mechanisms in LLMs.

Paper Structure

This paper contains 27 sections, 3 equations, 33 figures, 4 tables.

Figures (33)

  • Figure 1: Visualization of our detection results. The detection ranges from red (not memorized) to blue (memorized). As we can see, the detection is extremely precise, completely separating the memorized sequences from the not memorized ones, even when the sequences are randomly arranged together.
  • Figure 2: Distribution of Cohen's $d$ Values for Output Activations Separating Memorized vs. Not Memorized Tokens. We highlight the activations with a Cohen's $d$ above 1 and indicate their proportion among all activations.
  • Figure 3: Distribution of Cohen's $d$ Values for MLP Activations Separating Memorized vs. Not Memorized Tokens. It is visually clear how much more effectively the MLP activations can separate the two groups compared to other activation types, not only in the number of neurons but also in their proportion among all neurons.
  • Figure 4: Activation Values for Neuron 6181 in MLP Layer 10. The Memorized (blue) and Not Memorized (orange) are clearly separated.
  • Figure 5: Classification Accuracy Using Best Activation on the MLP per layer.
  • ...and 28 more figures