Repository Structure-Aware Training Makes SLMs Better Issue Resolver
Zexiong Ma, Shengnan An, Zeqi Lin, Yanzhen Zou, Bing Xie
TL;DR
This work addresses the gap in repository-level problem solving by small language models (SLMs) through Repository Structure-Aware Training (ReSAT). By constructing two data types—localization data across file, function, and line granularity, and code-edit data—from a large set of open-source GitHub issues and PRs, ReSAT fine-tunes SLMs to better understand repository structure and perform context-aware edits. Evaluations on SWE-Bench-verified and RepoQA demonstrate that ReSAT improves issue-resolving performance and long-context code understanding, with ablations confirming the complementary benefits of localization and code-edit data. The approach offers a practical path to bring competitive repository-level capabilities to open-source SLMs while reducing reliance on costly LLMs.
Abstract
Language models have been applied to various software development tasks, but the performance varies according to the scale of the models. Large Language Models (LLMs) outperform Small Language Models (SLMs) in complex tasks like repository-level issue resolving, but raise concerns about privacy and cost. In contrast, SLMs are more accessible but under-perform in complex tasks. In this paper, we introduce ReSAT (Repository Structure-Aware Training), construct training data based on a large number of issues and corresponding pull requests from open-source communities to enhance the model's understanding of repository structure and issue resolving ability. We construct two types of training data: (1) localization training data, a multi-level progressive localization data to improve code understanding and localization capability; (2) code edit training data, which improves context-based code editing capability. The evaluation results on SWE-Bench-verified and RepoQA demonstrate that ReSAT effectively enhances SLMs' issue-resolving and repository-level long-context understanding capabilities.
