Table of Contents
Fetching ...

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

Ulyana Piterbarg, Lerrel Pinto, Rob Fergus

TL;DR

This work reframes code synthesis as a sequential edit problem by introducing LintSeq, which uses a linter to generate insertion-only edits that progressively build programs. LintSeq converts existing code into insert-focused edit sequences, enabling autoregressive models to learn code by editing step-by-step, rather than generating entire programs in one pass. Across tiny and larger models, fine-tuning on LintSeq data yields improvements in pass@k and demonstrates more favorable scaling with test-time compute, while ablations confirm the critical role of the linter-guided edit structure. The approach is architecture- and tokenizer-agnostic, data-efficient for on-device models, and suggests broader potential for code understanding and efficient synthesis with synthetic, structured edits. Limitations include the current focus on insertions and Python, motivating future work on deletions, multi-language support, and larger-scale pretraining.

Abstract

Software engineers mainly write code by editing existing programs. In contrast, language models (LMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of sequential edit data. While high-quality instruction data for code synthesis is scarce, edit data for synthesis is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors programs into sequences of synthetic edits by using a linter to procedurally sample across interdependent lines of source code. Synthetic edits sampled with LintSeq reflect the syntax and semantics of their programming language. To test the algorithm, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we fine-tune a series of smaller LMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset. We perform comprehensive evaluations comparing edit sequence code LMs against baselines on HumanEval, MBPP(+), CodeContests, DS-1000, and BigCodeBench. We show that models fine-tuned to iteratively synthesize code match or outperform baselines on pass@1, and exhibit better scaling across higher pass@k as a function of total test-time FLOPs. Finally, we also pretrain our own tiny LMs for code understanding. We show that fine-tuning these models to synthesize code edit-by-edit results in strong performance on HumanEval and MBPP(+) compared to existing code language models of similar scale such as CodeT5+, AlphaCode, and Codex.

Training Language Models on Synthetic Edit Sequences Improves Code Synthesis

TL;DR

This work reframes code synthesis as a sequential edit problem by introducing LintSeq, which uses a linter to generate insertion-only edits that progressively build programs. LintSeq converts existing code into insert-focused edit sequences, enabling autoregressive models to learn code by editing step-by-step, rather than generating entire programs in one pass. Across tiny and larger models, fine-tuning on LintSeq data yields improvements in pass@k and demonstrates more favorable scaling with test-time compute, while ablations confirm the critical role of the linter-guided edit structure. The approach is architecture- and tokenizer-agnostic, data-efficient for on-device models, and suggests broader potential for code understanding and efficient synthesis with synthetic, structured edits. Limitations include the current focus on insertions and Python, motivating future work on deletions, multi-language support, and larger-scale pretraining.

Abstract

Software engineers mainly write code by editing existing programs. In contrast, language models (LMs) autoregressively synthesize programs in a single pass. One explanation for this is the scarcity of sequential edit data. While high-quality instruction data for code synthesis is scarce, edit data for synthesis is even scarcer. To fill this gap, we develop a synthetic data generation algorithm called LintSeq. This algorithm refactors programs into sequences of synthetic edits by using a linter to procedurally sample across interdependent lines of source code. Synthetic edits sampled with LintSeq reflect the syntax and semantics of their programming language. To test the algorithm, we use it to refactor a dataset of instruction + program pairs into instruction + program-diff-sequence tuples. Then, we fine-tune a series of smaller LMs ranging from 2.6B to 14B parameters on both the re-factored and original versions of this dataset. We perform comprehensive evaluations comparing edit sequence code LMs against baselines on HumanEval, MBPP(+), CodeContests, DS-1000, and BigCodeBench. We show that models fine-tuned to iteratively synthesize code match or outperform baselines on pass@1, and exhibit better scaling across higher pass@k as a function of total test-time FLOPs. Finally, we also pretrain our own tiny LMs for code understanding. We show that fine-tuning these models to synthesize code edit-by-edit results in strong performance on HumanEval and MBPP(+) compared to existing code language models of similar scale such as CodeT5+, AlphaCode, and Codex.
Paper Structure (49 sections, 2 equations, 11 figures, 13 tables)

This paper contains 49 sections, 2 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Code synthesis with LMs trained on synthetic code edit sequences. Left: An example generation from an LM trained to synthesize code as a stream of linter-error-free edits. Right: Training LMs to write code edit-by-edit by preprocessing instruction data for SFT with LintSeq improves test-time scaling laws during repeated sampling, i.e. the percentage of benchmark problems solved by any attempt (pass@k) as a function of total test-time FLOPs compared to training on standard data (see Appendix \ref{['appendix:fig1_computation']}). Shading indicates standard error in linear fit.
  • Figure 2: LintSeq: Training LMs to write code edit-by-edit with supervised learning by generating synthetic data. LintSeq decomposes existing programs into synthetic edits that reflect the syntax & semantics of their programming language. At each iteration, the algorithm samples an edit chunk from a program by: randomly selecting a line of code to delete; identifying the minimal set of lines that are dependent on this line with a code linter; and finally, removing the line and its dependents. These steps are repeated until all lines of code have been removed. LintSeq then processes the reversed sequence of program states with Unix-diff to express it as a sequence of edits.
  • Figure 3: HumanEval, MBPP(+), DS-1000, and BigCodeBench (Instruct) results for Gemma 2, Phi-3, and Llama 3.1 models after SFT on LintSeq (indigo) vs standard Python code (grey). On HumanEval and MBPP(+), we tune sampling temp., top-p, and min-p over $\{1, 1.1, 1.2\}$, $\{0.95, 1.0\}$, and $\{0, 0.05\}$, respectively with $n=64$ samples. On DS-1000, we evaluate models with the completion format, temperature $=0.2$, top-p $=0.5$, min-p $=0$, and $n=40$, following wei2024magicoder and luo2023wizardcoder. On BigCodeBench Instruct, we evaluate with greedy decoding zhuo2024bigcodebench. Error bars on HumanEval and MBPP scores show standard error.
  • Figure 4: Repeatedly sampling from models SFT-ed to generate edit seqs. vs full programs: we compare the best pass@k score achieved by modulating sampling hyperparameters for LintSeqInstruct vs Instruct models. On HumanEval and MBPP(+), we use the same values as in Figure \ref{['fig:diff_vs_raw_fine-tuning_agg']}, while on CodeContests, we sweep over temperatures $\{0.5, 0.6\}$ and use top-p $=1.0$, min-p $=0$, and $n=128$. We then plot benchmark score as a function of the total cost of repeated sampling from each model in FLOPs (see Appendix \ref{['appendix:fig1_computation']}). Shading shows standard error in linear fit. See Figure \ref{['fig:teaser']} for Phi-3 3.8B and Llama 3.1 8B test-time scaling with repeated sampling curves on HumanEval and MBPP.
  • Figure 5: Left: HumanEval and MBPP(+) pass@1 achieved by fine-tuning TinyCodeLM models on linter-guided (LintSeq) vs randomly sampled (RandSeq) code edit sequences. We tune sampling parameters over the same values as in Figures \ref{['fig:diff_vs_raw_fine-tuning_agg']} and \ref{['fig:score_with_sampling']}, and report the best scores for each model. Right: Comparing total proportions of generations with lint errors. Error bars show standard error.
  • ...and 6 more figures