Pull Requests as a Training Signal for Repo-Level Code Editing

Qinglin Zhu; Tianyu Chen; Shuai Lu; Lei Ji; Runcong Zhao; Murong Ma; Xiangxiang Dai; Yulan He; Lin Gui; Peng cheng; Yeyun Gong

Pull Requests as a Training Signal for Repo-Level Code Editing

Qinglin Zhu, Tianyu Chen, Shuai Lu, Lei Ji, Runcong Zhao, Murong Ma, Xiangxiang Dai, Yulan He, Lin Gui, Peng cheng, Yeyun Gong

TL;DR

This work tackles the challenge of teaching models repository-level code editing by mining high-quality supervision from real pull requests. It introduces Clean-PR, a data-centric mid-training pipeline that filters noisy PRs, reconstructs deterministic Search/Replace edits, and augments context with linked issues, yielding a large verifiable corpus of 2 million instances across 12 languages. An agentless, stepwise SFT regimen with error-driven augmentation aligns the model with a localisation-navigation-editing workflow and boosts SWE-bench performance beyond agent-based systems and larger models. The results demonstrate that repository-level editing capabilities can be effectively encoded in model weights, reducing reliance on complex inference scaffolding while maintaining strong generalization and robustness. The work provides a scalable, reproducible data framework and practical insights for integrating repository-editing priors into future code-synthesis models.

Abstract

Repository-level code editing requires models to understand complex dependencies and execute precise multi-file modifications across a large codebase. While recent gains on SWE-bench rely heavily on complex agent scaffolding, it remains unclear how much of this capability can be internalised via high-quality training signals. To address this, we propose Clean Pull Request (Clean-PR), a mid-training paradigm that leverages real-world GitHub pull requests as a training signal for repository-level editing. We introduce a scalable pipeline that converts noisy pull request diffs into Search/Replace edit blocks through reconstruction and validation, resulting in the largest publicly available corpus of 2 million pull requests spanning 12 programming languages. Using this training signal, we perform a mid-training stage followed by an agentless-aligned supervised fine-tuning process with error-driven data augmentation. On SWE-bench, our model significantly outperforms the instruction-tuned baseline, achieving absolute improvements of 13.6% on SWE-bench Lite and 12.3% on SWE-bench Verified. These results demonstrate that repository-level code understanding and editing capabilities can be effectively internalised into model weights under a simplified, agentless protocol, without relying on heavy inference-time scaffolding.

Pull Requests as a Training Signal for Repo-Level Code Editing

TL;DR

Abstract

Paper Structure (63 sections, 4 figures, 20 tables, 1 algorithm)

This paper contains 63 sections, 4 figures, 20 tables, 1 algorithm.

Introduction
Data Construction
Clean-PR: Verified Mid-Training Data
Data Collection.
Data Filtering.
Search/Replace format Reconstruction.
Issue-Augmented Intent.
From Clean-PR-full to Clean-PR-train.
Agentless-Aligned Stepwise SFT
Task Decomposition and Filtering.
Error-Driven Augmentation.
Experiments
Experiment Setup
Training Configurations.
Benchmarks and Metrics.
...and 48 more sections

Figures (4)

Figure 1: Overview of the Clean-PR Framework.(a) Data Construction: Raw GitHub PRs undergo a rigorous filtering pipeline (bot detection, core language enforcement) and intent augmentation via linked Issues. The valid diffs are then converted into minimal unique Search/Replace blocks, verified through round-trip patch application to ensure correctness. (b) Two-Stage Training Pipeline: The base model first undergoes Mid-Training on the verifiable Clean-PR corpus to encode repository-level editing priors. This is followed by an Agentless-Aligned Stepwise SFT, where the model is fine-tuned on decomposed tasks (Localisation $\rightarrow$ Navigation $\rightarrow$ Editing) with Error-Driven Augmentation to robustly handle distracting repository contexts.
Figure 2: Generalisation dynamics during mid-training.
Figure 3: Pass@k performance on SWE-bench Lite and Verified. We report the resolution rates of our model (Clean-PR, mid-trained on All Languages) as the number of samples $k$ scales.
Figure 4: The Life of a Data Point: From Raw Noise to Verified Signal.Track A (Left) illustrates the aggressive pruning of noise, rejecting inputs due to bot activity, unmerged status, non-core language files, or missing history. Track B (Right) depicts the transformation of a valid PR: it is augmented with the linked Issue context to recover user intent and converted into a deterministic Search/Replace block for verifiable training.

Pull Requests as a Training Signal for Repo-Level Code Editing

TL;DR

Abstract

Pull Requests as a Training Signal for Repo-Level Code Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (4)