Table of Contents
Fetching ...

Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Frontier AI Models

Sunny Duan, Mikail Khona, Abhiram Iyer, Rylan Schaeffer, Ila R Fiete

TL;DR

<3-5 sentence high-level summary>Frontier AI models trained on large web-scale data pose privacy and security risks due to memorization and potential data leakage. The paper analyzes memorization dynamics using a new metric, kl-LD, and a z-complexity proxy, showing that memorization probability scales with the number of repeats and the simplicity of sequences, and that memories can be latent yet retrievable. It demonstrates that memorization is largely stationary after initial exposure, and that weight perturbations can uncover latent memories, motivating a cross-entropy–based diagnostic to detect hidden leakage. Overall, the work highlights practical privacy risks in frontier models and provides diagnostic tools and mechanistic hypotheses to mitigate leakage, while noting the need for broader validation across models and longer training timelines.

Abstract

Frontier AI systems are making transformative impacts across society, but such benefits are not without costs: models trained on web-scale datasets containing personal and private data raise profound concerns about data privacy and security. Language models are trained on extensive corpora including potentially sensitive or proprietary information, and the risk of data leakage - where the model response reveals pieces of such information - remains inadequately understood. Prior work has investigated what factors drive memorization and have identified that sequence complexity and the number of repetitions drive memorization. Here, we focus on the evolution of memorization over training. We begin by reproducing findings that the probability of memorizing a sequence scales logarithmically with the number of times it is present in the data. We next show that sequences which are apparently not memorized after the first encounter can be "uncovered" throughout the course of training even without subsequent encounters, a phenomenon we term "latent memorization". The presence of latent memorization presents a challenge for data privacy as memorized sequences may be hidden at the final checkpoint of the model but remain easily recoverable. To this end, we develop a diagnostic test relying on the cross entropy loss to uncover latent memorized sequences with high accuracy.

Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Frontier AI Models

TL;DR

<3-5 sentence high-level summary>Frontier AI models trained on large web-scale data pose privacy and security risks due to memorization and potential data leakage. The paper analyzes memorization dynamics using a new metric, kl-LD, and a z-complexity proxy, showing that memorization probability scales with the number of repeats and the simplicity of sequences, and that memories can be latent yet retrievable. It demonstrates that memorization is largely stationary after initial exposure, and that weight perturbations can uncover latent memories, motivating a cross-entropy–based diagnostic to detect hidden leakage. Overall, the work highlights practical privacy risks in frontier models and provides diagnostic tools and mechanistic hypotheses to mitigate leakage, while noting the need for broader validation across models and longer training timelines.

Abstract

Frontier AI systems are making transformative impacts across society, but such benefits are not without costs: models trained on web-scale datasets containing personal and private data raise profound concerns about data privacy and security. Language models are trained on extensive corpora including potentially sensitive or proprietary information, and the risk of data leakage - where the model response reveals pieces of such information - remains inadequately understood. Prior work has investigated what factors drive memorization and have identified that sequence complexity and the number of repetitions drive memorization. Here, we focus on the evolution of memorization over training. We begin by reproducing findings that the probability of memorizing a sequence scales logarithmically with the number of times it is present in the data. We next show that sequences which are apparently not memorized after the first encounter can be "uncovered" throughout the course of training even without subsequent encounters, a phenomenon we term "latent memorization". The presence of latent memorization presents a challenge for data privacy as memorized sequences may be hidden at the final checkpoint of the model but remain easily recoverable. To this end, we develop a diagnostic test relying on the cross entropy loss to uncover latent memorized sequences with high accuracy.
Paper Structure (19 sections, 10 figures, 1 table)

This paper contains 19 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Data statistics and the probability of memorizationa. Plot of average kl-LD as a function of the number of times the sequence is repeated in the dataset for Pythia-1b and Amber-7b b. Average kl-LD as a function of the Z-complexity of the sequence. c. Relationship between kl-LD and repeats for different complexity levels. d. Comparison of the predictions of the best linear model predicting the kl-LD from the logarithm of the sample complexity and number of repeats.
  • Figure 2: Memorization status is stationary a. Histograms of changes of edit distance between consecutive checkpoints for sequences which were encountered once during training. Notably, the change in kl-LD is symmetric between consecutive checkpoints. This is surprising since the model appears to "forget" the sequence during one timestep but recover it later on. b. Distribution of kl-LD during checkpoint 10k and 11k. Color is the log of the number of sequences in each bin. The vast majority of sequences are not memorized in either checkpoint. c. Visualization of individual samples and the change in the memorized length during training. d. Grey lines are subsampled single sequence trajectories throughout training. Each sequence was normalized such that the distribution of memorization lengths was mean 0 and variance 1. Red line denotes the mean and shaded area denotes region of two standard deviations of the kl-LD of all sequences at a single point in time. Notably, the distribution at each timestep is the same for all checkpoints. This is in contrast to both the expected exponential decay behavior exhibited by models which experience catastrophic forgetting as well as the linear growth of variance which is expected of processes exhibiting random walk behavior.
  • Figure 3: a. Comparison of the distribution of best achievable kl-LD by perturbing the model weights. Data points were selected such that they were un-memorized (kl-LD $>$ 50) at 10k but we're memorized (kl-LD $<10$) at some point during the next 10k training steps. Top panel is the histogram of the perturbations of the checkpoint at 19k and bottom is 10k. Notably, the perturbations cause the 10k model distances to match the distribution of the 19k model, and perturbing the 19k model does not have a significant effect. This is indicative of how model training mimics random noise with respect to the memorization status of the sequences. b. Comparison of using perturbations to evoke a target sequence for three different classes of sequences. In the top panel, we examine the sequences which are "latent" memorized. In the middle panel, we find sequences which weren't memorized during training and in the bottom panel, we analyze sequences which were encountered later in training but were not encountered by the model. We not that perturbing the weights is only able to evoke sequences which are "latent" memorized. c. Comparison of the cross entropy losses of sequences separated into the three different classes of sequences analyzed in b. The cross entropy losses of "latent" memorized sequences are much lower. d. Drawing of a mechanistic proposal for how memorization is stabilized during training. e. Visualization of the Levenshtein distances from the target for various perturbations. Each row is a single sequence, and the heights of the bars correspond to the number of perturbations which resulted in a Levenshtein distance of the corresponding bin.
  • Figure 4: Histogram of the repeats vs the edit distance Hue is log density.
  • Figure 5: Histogram of the repeats vs the edit distance split by complexity Hue is log density.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 2.1: kl-LD distance