Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models
Yuheng Tang, Hongwei Li, Kaijie Zhu, Michael Yang, Yangruibo Ding, Wenbo Guo
TL;DR
Co-PatcheR addresses patching by decomposing the task into specialized, small reasoning components for localization, patch generation, and validation. It trains Loc-Gen for localization and generation, plus Val-assert and Val-no-assert to produce diverse PoCs and judge patch correctness, using a majority vote to finalize patches. With 3×14B models and ~6K training samples, it achieves 46% resolved on SWE-bench-Verified and demonstrates strong data- and compute-efficiency relative to end-to-end baselines and larger models. Extensive ablations validate the design choices, including two-step localization, critique-enabled generation, diverse PoC testing, and testing-phase scaling, highlighting the practical benefits of modular patching pipelines. The work suggests that small, component-specific models can rival larger systems in patching performance while reducing training and inference costs, with implications for more scalable, privacy-conscious code repair workflows.
Abstract
Motivated by the success of general-purpose large language models (LLMs) in software patching, recent works started to train specialized patching models. Most works trained one model to handle the end-to-end patching pipeline (including issue localization, patch generation, and patch validation). However, it is hard for a small model to handle all tasks, as different sub-tasks have different workflows and require different expertise. As such, by using a 70 billion model, SOTA methods can only reach up to 41% resolved rate on SWE-bench-Verified. Motivated by the collaborative nature, we propose Co-PatcheR, the first collaborative patching system with small and specialized reasoning models for individual components. Our key technique novelties are the specific task designs and training recipes. First, we train a model for localization and patch generation. Our localization pinpoints the suspicious lines through a two-step procedure, and our generation combines patch generation and critique. We then propose a hybrid patch validation that includes two models for crafting issue-reproducing test cases with and without assertions and judging patch correctness, followed by a majority vote-based patch selection. Through extensive evaluation, we show that Co-PatcheR achieves 46% resolved rate on SWE-bench-Verified with only 3 x 14B models. This makes Co-PatcheR the best patcher with specialized models, requiring the least training resources and the smallest models. We conduct a comprehensive ablation study to validate our recipes, as well as our choice of training data number, model size, and testing-phase scaling strategy.
