Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Frontier AI Models
Sunny Duan, Mikail Khona, Abhiram Iyer, Rylan Schaeffer, Ila R Fiete
TL;DR
<3-5 sentence high-level summary>Frontier AI models trained on large web-scale data pose privacy and security risks due to memorization and potential data leakage. The paper analyzes memorization dynamics using a new metric, kl-LD, and a z-complexity proxy, showing that memorization probability scales with the number of repeats and the simplicity of sequences, and that memories can be latent yet retrievable. It demonstrates that memorization is largely stationary after initial exposure, and that weight perturbations can uncover latent memories, motivating a cross-entropy–based diagnostic to detect hidden leakage. Overall, the work highlights practical privacy risks in frontier models and provides diagnostic tools and mechanistic hypotheses to mitigate leakage, while noting the need for broader validation across models and longer training timelines.
Abstract
Frontier AI systems are making transformative impacts across society, but such benefits are not without costs: models trained on web-scale datasets containing personal and private data raise profound concerns about data privacy and security. Language models are trained on extensive corpora including potentially sensitive or proprietary information, and the risk of data leakage - where the model response reveals pieces of such information - remains inadequately understood. Prior work has investigated what factors drive memorization and have identified that sequence complexity and the number of repetitions drive memorization. Here, we focus on the evolution of memorization over training. We begin by reproducing findings that the probability of memorizing a sequence scales logarithmically with the number of times it is present in the data. We next show that sequences which are apparently not memorized after the first encounter can be "uncovered" throughout the course of training even without subsequent encounters, a phenomenon we term "latent memorization". The presence of latent memorization presents a challenge for data privacy as memorized sequences may be hidden at the final checkpoint of the model but remain easily recoverable. To this end, we develop a diagnostic test relying on the cross entropy loss to uncover latent memorized sequences with high accuracy.
