Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting

Lizhe Zhang; Wentao Chen; Li Zhong; Letian Peng; Zilong Wang; Jingbo Shang

Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting

Lizhe Zhang, Wentao Chen, Li Zhong, Letian Peng, Zilong Wang, Jingbo Shang

TL;DR

The paper tackles the problem of distinguishing memorization from genuine generalization in LLM-based code generation. It introduces a code rewriting pipeline to create semantically altered tasks and defines Memorization Risk Index (MRI) as the product of similarity to the original solution and the relative accuracy drop after rewriting. Through experiments on MBPP+ and BigCodeBench, the authors show that memorization risk tends to shrink with model scale on easier tasks but persists on harder ones, and that supervised fine-tuning increases accuracy at the cost of higher memorization, while PPO offers a better accuracy–risk balance. These findings provide guidance for selecting fine-tuning strategies and highlight the need for evaluation metrics that separate harmful memorization from benign reuse.

Abstract

Large language models (LLMs) have recently demonstrated exceptional code generation capabilities. However, there is a growing debate whether LLMs are mostly doing memorization (i.e., replicating or reusing large parts of their training data) versus generalization (i.e., beyond training data). Existing evaluations largely proxy memorization with surface/structural similarity, thereby conflating benign reuse of repeated code with harmful recall and neglecting task correctness under semantic variation. We define harmful memorization behaviorally as failure at high similarity and introduce a semantic perturbation code rewriting, which rewrites a semantically different answer at a similar difficulty level for a given coding task, then reverse-engineers a novel coding task. We further propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model's answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart. MRI is high only when both conditions hold -- when the model outputs similar code but fails the perturbed task -- thereby capturing harmful memorization rather than benign reuse of repeated code. Empirical evaluations on code generation benchmarks MBPP+ and BigCodeBench reveal that (1) memorization does not increase with larger models and in many cases alleviates as they scale; (2) supervised fine-tuning (SFT) improves accuracy while introduces memorization; (3) reinforcement learning with proximal policy optimization (PPO) achieves a more balanced trade-off between memorization and generalization.

Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting

TL;DR

Abstract

Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)