Table of Contents
Fetching ...

A Lightweight Method to Disrupt Memorized Sequences in LLM

Parjanya Prajakta Prashant, Kaustubh Ponkshe, Babak Salimi

TL;DR

The paper tackles the challenge of memorized verbatim generation in large language models by introducing TokenSwap, a lightweight, post-hoc defense that operates only with token-level outputs. TokenSwap selectively replaces probabilities for a small, grammar-focused set of tokens with those from a compact auxiliary model, disrupting memorized generation without retraining or access to model weights. Across extreme memorization scenarios and production-grade models, TokenSwap achieves up to 10× reductions in memorization with negligible degradation in downstream tasks, and approaches the effectiveness of pre-training methods like Goldfish without requiring data access. The approach is practical for real-world deployments, preserving fluency and instruction-following while mitigating copyright and ethical risks associated with verbatim content leakage.

Abstract

As language models scale, their performance improves dramatically across a wide range of tasks, but so does their tendency to memorize and regurgitate parts of their training data verbatim. This tradeoff poses serious legal, ethical, and safety concerns, especially in real-world deployments. Existing mitigation techniques, such as differential privacy or model unlearning, often require retraining or access to internal weights making them impractical for most users. In this work, we introduce TokenSwap, a lightweight, post-hoc defense designed for realistic settings where the user can only access token-level outputs. Our key insight is that while large models are necessary for high task performance, small models (e.g., DistilGPT-2) are often sufficient to assign fluent, grammatically plausible probabilities to common function words - and crucially, they memorize far less. By selectively swapping token probabilities between models, TokenSwap preserves the capabilities of large models while reducing their propensity for verbatim reproduction. Evaluations on Pythia-6.9B and Llama-3-8B show up to a 10$\times$ drop in exact memorization with negligible task degradation. Our method offers a practical, accessible solution for mitigating memorized generation in deployed LLMs.

A Lightweight Method to Disrupt Memorized Sequences in LLM

TL;DR

The paper tackles the challenge of memorized verbatim generation in large language models by introducing TokenSwap, a lightweight, post-hoc defense that operates only with token-level outputs. TokenSwap selectively replaces probabilities for a small, grammar-focused set of tokens with those from a compact auxiliary model, disrupting memorized generation without retraining or access to model weights. Across extreme memorization scenarios and production-grade models, TokenSwap achieves up to 10× reductions in memorization with negligible degradation in downstream tasks, and approaches the effectiveness of pre-training methods like Goldfish without requiring data access. The approach is practical for real-world deployments, preserving fluency and instruction-following while mitigating copyright and ethical risks associated with verbatim content leakage.

Abstract

As language models scale, their performance improves dramatically across a wide range of tasks, but so does their tendency to memorize and regurgitate parts of their training data verbatim. This tradeoff poses serious legal, ethical, and safety concerns, especially in real-world deployments. Existing mitigation techniques, such as differential privacy or model unlearning, often require retraining or access to internal weights making them impractical for most users. In this work, we introduce TokenSwap, a lightweight, post-hoc defense designed for realistic settings where the user can only access token-level outputs. Our key insight is that while large models are necessary for high task performance, small models (e.g., DistilGPT-2) are often sufficient to assign fluent, grammatically plausible probabilities to common function words - and crucially, they memorize far less. By selectively swapping token probabilities between models, TokenSwap preserves the capabilities of large models while reducing their propensity for verbatim reproduction. Evaluations on Pythia-6.9B and Llama-3-8B show up to a 10 drop in exact memorization with negligible task degradation. Our method offers a practical, accessible solution for mitigating memorized generation in deployed LLMs.

Paper Structure

This paper contains 54 sections, 2 equations, 4 figures, 10 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of TokenSwap. Our approach replaces token probabilities of high-frequency "grammar-based" tokens with those from a small auxiliary language model. This mitigates memorized generation while maintaining fluency and model performance. The top path shows standard LLM generation, while the bottom path demonstrates how TokenSwap alters token selection to disrupt memorization and produce novel text.
  • Figure 2: Memorization (EMR) vs Performance (CE Loss) across different model sizes. Larger, more capable models exhibit higher memorization. TokenSwap, with Pythia-70M as the auxiliary model, achieves low memorization rates while maintaining competitive performance. Details in Section \ref{['subsection:wild']} and Section \ref{['section:discussion']}.
  • Figure 3: Comparison of text generation methods. Red text indicates memorized content. Standard generation reproduces the entire suffix verbatim, while TokenSwap generates novel content.
  • Figure 4: We compare TokenSwap with Goldfish hans2024like on RougeL score distributions for Wikipedia generations bridge2001wikipedia. The similar distributions of TokenSwap and Goldfish (k=3) demonstrate that our inference-time approach is comparable to expensive pre-training methods in reducing memorization.

Theorems & Definitions (1)

  • Definition 1: Extractable Memorization