Table of Contents
Fetching ...

Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting

Lizhe Zhang, Wentao Chen, Li Zhong, Letian Peng, Zilong Wang, Jingbo Shang

TL;DR

The paper tackles the problem of distinguishing memorization from genuine generalization in LLM-based code generation. It introduces a code rewriting pipeline to create semantically altered tasks and defines Memorization Risk Index (MRI) as the product of similarity to the original solution and the relative accuracy drop after rewriting. Through experiments on MBPP+ and BigCodeBench, the authors show that memorization risk tends to shrink with model scale on easier tasks but persists on harder ones, and that supervised fine-tuning increases accuracy at the cost of higher memorization, while PPO offers a better accuracy–risk balance. These findings provide guidance for selecting fine-tuning strategies and highlight the need for evaluation metrics that separate harmful memorization from benign reuse.

Abstract

Large language models (LLMs) have recently demonstrated exceptional code generation capabilities. However, there is a growing debate whether LLMs are mostly doing memorization (i.e., replicating or reusing large parts of their training data) versus generalization (i.e., beyond training data). Existing evaluations largely proxy memorization with surface/structural similarity, thereby conflating benign reuse of repeated code with harmful recall and neglecting task correctness under semantic variation. We define harmful memorization behaviorally as failure at high similarity and introduce a semantic perturbation code rewriting, which rewrites a semantically different answer at a similar difficulty level for a given coding task, then reverse-engineers a novel coding task. We further propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model's answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart. MRI is high only when both conditions hold -- when the model outputs similar code but fails the perturbed task -- thereby capturing harmful memorization rather than benign reuse of repeated code. Empirical evaluations on code generation benchmarks MBPP+ and BigCodeBench reveal that (1) memorization does not increase with larger models and in many cases alleviates as they scale; (2) supervised fine-tuning (SFT) improves accuracy while introduces memorization; (3) reinforcement learning with proximal policy optimization (PPO) achieves a more balanced trade-off between memorization and generalization.

Memorize or Generalize? Evaluating LLM Code Generation with Code Rewriting

TL;DR

The paper tackles the problem of distinguishing memorization from genuine generalization in LLM-based code generation. It introduces a code rewriting pipeline to create semantically altered tasks and defines Memorization Risk Index (MRI) as the product of similarity to the original solution and the relative accuracy drop after rewriting. Through experiments on MBPP+ and BigCodeBench, the authors show that memorization risk tends to shrink with model scale on easier tasks but persists on harder ones, and that supervised fine-tuning increases accuracy at the cost of higher memorization, while PPO offers a better accuracy–risk balance. These findings provide guidance for selecting fine-tuning strategies and highlight the need for evaluation metrics that separate harmful memorization from benign reuse.

Abstract

Large language models (LLMs) have recently demonstrated exceptional code generation capabilities. However, there is a growing debate whether LLMs are mostly doing memorization (i.e., replicating or reusing large parts of their training data) versus generalization (i.e., beyond training data). Existing evaluations largely proxy memorization with surface/structural similarity, thereby conflating benign reuse of repeated code with harmful recall and neglecting task correctness under semantic variation. We define harmful memorization behaviorally as failure at high similarity and introduce a semantic perturbation code rewriting, which rewrites a semantically different answer at a similar difficulty level for a given coding task, then reverse-engineers a novel coding task. We further propose Memorization Risk Index (MRI), a normalized score that combines two signals: (i) how similar the model's answer for the rewritten task is to the original ground-truth solution, and (ii) how much performance drops from the original task to its rewritten counterpart. MRI is high only when both conditions hold -- when the model outputs similar code but fails the perturbed task -- thereby capturing harmful memorization rather than benign reuse of repeated code. Empirical evaluations on code generation benchmarks MBPP+ and BigCodeBench reveal that (1) memorization does not increase with larger models and in many cases alleviates as they scale; (2) supervised fine-tuning (SFT) improves accuracy while introduces memorization; (3) reinforcement learning with proximal policy optimization (PPO) achieves a more balanced trade-off between memorization and generalization.

Paper Structure

This paper contains 59 sections, 8 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Our proposed Code Rewriting vs. Popular semantic equivalent perturbations. $X$ denotes text and $C$ denotes code. Code rewriting that creates semantically different tasks, first rewrite a new code solution $C_\text{rew}$ from the origin solution $C$, then generating a new description $X_\text{rew}$ based on $C_\text{rew}$. A judge agent will then choose to accept or reject the code rewriting task for quality assurance. Mutation and paraphrase that create semantically equivalent tasks, are included for robustness evaluation as a comparison to memorization. All perturbations are performed by GPT-5, shown as the ChatGPT logo. Generation prompts are in Appendix \ref{['appendix:prompts']}.
  • Figure 2: Scaling trends in MRI across Qwen-2.5 Instruct vs. Coder on MBPP+ and BigCodeBench.
  • Figure 3: Effect of fine-tuning on Qwen-2.5-7B (base and Coder) on BigCodeBench. SFT raises $\text{Acc}(\mathcal{T}_{\text{ori}})$ but also increases $\text{Sim}(\mathcal{T}_{\text{rew}})$ and $\mathrm{RAD}_{\text{rew}}$, inflating MRI; PPO preserves or modestly improves accuracy while keeping $\mathrm{RAD}_{\text{rew}}$ low, yielding a better risk–accuracy trade-off. Checkpoints selected for SFT and PPO follows rules in \ref{['sec:finetuning']}. Dataset statistics can be found in \ref{['fine tune dataset details']}
  • Figure 4: Example of Task-99 from MBPP+ generated from Qwen2.5-Coder-32B-Instruct that PASSED in original but FAILED in code_rewriting.
  • Figure 5: Example of Task-224 from MBPP+ generated from Qwen2.5-Coder-32B-Instruct that PASSED in original but FAILED in code_rewriting.
  • ...and 8 more figures