Table of Contents
Fetching ...

Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models

Yuheng Tang, Hongwei Li, Kaijie Zhu, Michael Yang, Yangruibo Ding, Wenbo Guo

TL;DR

Co-PatcheR addresses patching by decomposing the task into specialized, small reasoning components for localization, patch generation, and validation. It trains Loc-Gen for localization and generation, plus Val-assert and Val-no-assert to produce diverse PoCs and judge patch correctness, using a majority vote to finalize patches. With 3×14B models and ~6K training samples, it achieves 46% resolved on SWE-bench-Verified and demonstrates strong data- and compute-efficiency relative to end-to-end baselines and larger models. Extensive ablations validate the design choices, including two-step localization, critique-enabled generation, diverse PoC testing, and testing-phase scaling, highlighting the practical benefits of modular patching pipelines. The work suggests that small, component-specific models can rival larger systems in patching performance while reducing training and inference costs, with implications for more scalable, privacy-conscious code repair workflows.

Abstract

Motivated by the success of general-purpose large language models (LLMs) in software patching, recent works started to train specialized patching models. Most works trained one model to handle the end-to-end patching pipeline (including issue localization, patch generation, and patch validation). However, it is hard for a small model to handle all tasks, as different sub-tasks have different workflows and require different expertise. As such, by using a 70 billion model, SOTA methods can only reach up to 41% resolved rate on SWE-bench-Verified. Motivated by the collaborative nature, we propose Co-PatcheR, the first collaborative patching system with small and specialized reasoning models for individual components. Our key technique novelties are the specific task designs and training recipes. First, we train a model for localization and patch generation. Our localization pinpoints the suspicious lines through a two-step procedure, and our generation combines patch generation and critique. We then propose a hybrid patch validation that includes two models for crafting issue-reproducing test cases with and without assertions and judging patch correctness, followed by a majority vote-based patch selection. Through extensive evaluation, we show that Co-PatcheR achieves 46% resolved rate on SWE-bench-Verified with only 3 x 14B models. This makes Co-PatcheR the best patcher with specialized models, requiring the least training resources and the smallest models. We conduct a comprehensive ablation study to validate our recipes, as well as our choice of training data number, model size, and testing-phase scaling strategy.

Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models

TL;DR

Co-PatcheR addresses patching by decomposing the task into specialized, small reasoning components for localization, patch generation, and validation. It trains Loc-Gen for localization and generation, plus Val-assert and Val-no-assert to produce diverse PoCs and judge patch correctness, using a majority vote to finalize patches. With 3×14B models and ~6K training samples, it achieves 46% resolved on SWE-bench-Verified and demonstrates strong data- and compute-efficiency relative to end-to-end baselines and larger models. Extensive ablations validate the design choices, including two-step localization, critique-enabled generation, diverse PoC testing, and testing-phase scaling, highlighting the practical benefits of modular patching pipelines. The work suggests that small, component-specific models can rival larger systems in patching performance while reducing training and inference costs, with implications for more scalable, privacy-conscious code repair workflows.

Abstract

Motivated by the success of general-purpose large language models (LLMs) in software patching, recent works started to train specialized patching models. Most works trained one model to handle the end-to-end patching pipeline (including issue localization, patch generation, and patch validation). However, it is hard for a small model to handle all tasks, as different sub-tasks have different workflows and require different expertise. As such, by using a 70 billion model, SOTA methods can only reach up to 41% resolved rate on SWE-bench-Verified. Motivated by the collaborative nature, we propose Co-PatcheR, the first collaborative patching system with small and specialized reasoning models for individual components. Our key technique novelties are the specific task designs and training recipes. First, we train a model for localization and patch generation. Our localization pinpoints the suspicious lines through a two-step procedure, and our generation combines patch generation and critique. We then propose a hybrid patch validation that includes two models for crafting issue-reproducing test cases with and without assertions and judging patch correctness, followed by a majority vote-based patch selection. Through extensive evaluation, we show that Co-PatcheR achieves 46% resolved rate on SWE-bench-Verified with only 3 x 14B models. This makes Co-PatcheR the best patcher with specialized models, requiring the least training resources and the smallest models. We conduct a comprehensive ablation study to validate our recipes, as well as our choice of training data number, model size, and testing-phase scaling strategy.

Paper Structure

This paper contains 35 sections, 8 figures, 6 tables.

Figures (8)

  • Figure 1: The overall training recipes and inference pipeline of Co-PatcheR. We design one model for localization and generation, where each component has two steps. We further design two models for PoC generation with/without assertions. During inference, we conduct a PoC and functionality testing to select the final patch and conduct a majority vote when dynamic testing has ties.
  • Figure 2: The top@$5$ file-level and line-level accuracy for localization.
  • Figure 3: The pass@$1$ resolved rate for generation.
  • Figure 4: The resolved rate for different validation models and validation workflow.
  • Figure 5: More ablation studies on the generation component.
  • ...and 3 more figures