Unveiling Memorization in Code Models
Zhou Yang, Zhipeng Zhao, Chenyu Wang, Jieke Shi, Dongsun Kim, DongGyun Han, David Lo
TL;DR
The paper addresses the memorization of training data in large code generation models, demonstrating that verbatim snippets can be extracted and categorized into a taxonomy of 3 categories and 14 subcategories. Using CodeParrot and CodeParrot-small, the authors generate 20,000 outputs per generation strategy and identify over 40,125 memorized snippets via Type-1 clone detection, revealing how prompts and model size influence memorization. They show a strong link between training-data frequency and memorized outputs, propose four memorization-prediction metrics (notably PPL-zlib and PPL-PPL ratio), and validate their findings on deployed models Incoder and StarCoder. The study offers practical mitigation guidance (data deduplication, rights declarations, provenance) and highlights the real-world risks in deployed systems, motivating preventative measures and future work on larger languages and robust defenses.
Abstract
The availability of large-scale datasets, advanced architectures, and powerful computational resources have led to effective code models that automate diverse software engineering activities. The datasets usually consist of billions of lines of code from both open-source and private repositories. A code model memorizes and produces source code verbatim, which potentially contains vulnerabilities, sensitive information, or code with strict licenses, leading to potential security and privacy issues. This paper investigates an important problem: to what extent do code models memorize their training data? We conduct an empirical study to explore memorization in large pre-trained code models. Our study highlights that simply extracting 20,000 outputs (each having 512 tokens) from a code model can produce over 40,125 code snippets that are memorized from the training data. To provide a better understanding, we build a taxonomy of memorized contents with 3 categories and 14 subcategories. The results show that the prompts sent to the code models affect the distribution of memorized contents. We identify several key factors of memorization. Specifically, given the same architecture, larger models suffer more from memorization problems. A code model produces more memorization when it is allowed to generate longer outputs. We also find a strong positive correlation between the number of an output's occurrences in the training data and that in the generated outputs, which indicates that a potential way to reduce memorization is to remove duplicates in the training data. We then identify effective metrics that infer whether an output contains memorization accurately. We also make suggestions to deal with memorization.
