Table of Contents
Fetching ...

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

Yingwei Ma, Yongbin Li, Yihong Dong, Xue Jiang, Rongyu Cao, Jue Chen, Fei Huang, Binhua Li

TL;DR

The paper addresses the challenge of deploying open-source LLMs for software engineering tasks in private environments by proposing a unified Test-Time Compute (TTC) framework that scales inference-time reasoning instead of model size. It introduces Internal TTC (development-contextualized trajectory synthesis and rejection sampling) and External TTC (development-process-based search with Process and Outcome Reward Models and execution verification) to boost code reasoning. Empirical results on SWE-bench Verified show a 32B model achieving $46\%$ issue-resolution, rivaling much larger proprietary models and demonstrating dynamic test-time scaling as problem difficulty increases. The work provides open-source data, models, and code, highlighting practical implications for private deployments and guiding future research on adaptive, computation-aware software engineering agents.

Abstract

Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment challenges in private environments, prompting a critical question: \textit{How can personally deployable open-source LLMs achieve comparable code reasoning performance?} To this end, we propose a unified Test-Time Compute scaling framework that leverages increased inference-time computation instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. Internally, we introduce a \textit{development-contextualized trajectory synthesis} method leveraging real-world software repositories to bootstrap multi-stage reasoning processes, such as fault localization and patch generation. We further enhance trajectory quality through rejection sampling, rigorously evaluating trajectories along accuracy and complexity. Externally, we propose a novel \textit{development-process-based search} strategy guided by reward models and execution verification. This approach enables targeted computational allocation at critical development decision points, overcoming limitations of existing "end-point only" verification methods. Evaluations on SWE-bench Verified demonstrate our \textbf{32B model achieves a 46\% issue resolution rate}, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1. Additionally, we provide the empirical validation of the test-time scaling phenomenon within SWE agents, revealing that \textbf{models dynamically allocate more tokens to increasingly challenging problems}, effectively enhancing reasoning capabilities. We publicly release all training data, models, and code to facilitate future research. https://github.com/yingweima2022/SWE-Reasoner

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute

TL;DR

The paper addresses the challenge of deploying open-source LLMs for software engineering tasks in private environments by proposing a unified Test-Time Compute (TTC) framework that scales inference-time reasoning instead of model size. It introduces Internal TTC (development-contextualized trajectory synthesis and rejection sampling) and External TTC (development-process-based search with Process and Outcome Reward Models and execution verification) to boost code reasoning. Empirical results on SWE-bench Verified show a 32B model achieving issue-resolution, rivaling much larger proprietary models and demonstrating dynamic test-time scaling as problem difficulty increases. The work provides open-source data, models, and code, highlighting practical implications for private deployments and guiding future research on adaptive, computation-aware software engineering agents.

Abstract

Recent advancements in software engineering agents have demonstrated promising capabilities in automating program improvements. However, their reliance on closed-source or resource-intensive models introduces significant deployment challenges in private environments, prompting a critical question: \textit{How can personally deployable open-source LLMs achieve comparable code reasoning performance?} To this end, we propose a unified Test-Time Compute scaling framework that leverages increased inference-time computation instead of larger models. Our framework incorporates two complementary strategies: internal TTC and external TTC. Internally, we introduce a \textit{development-contextualized trajectory synthesis} method leveraging real-world software repositories to bootstrap multi-stage reasoning processes, such as fault localization and patch generation. We further enhance trajectory quality through rejection sampling, rigorously evaluating trajectories along accuracy and complexity. Externally, we propose a novel \textit{development-process-based search} strategy guided by reward models and execution verification. This approach enables targeted computational allocation at critical development decision points, overcoming limitations of existing "end-point only" verification methods. Evaluations on SWE-bench Verified demonstrate our \textbf{32B model achieves a 46\% issue resolution rate}, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1. Additionally, we provide the empirical validation of the test-time scaling phenomenon within SWE agents, revealing that \textbf{models dynamically allocate more tokens to increasingly challenging problems}, effectively enhancing reasoning capabilities. We publicly release all training data, models, and code to facilitate future research. https://github.com/yingweima2022/SWE-Reasoner

Paper Structure

This paper contains 20 sections, 4 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison between the performance of smaller LLMs with extended Test-Time Compute and larger models on SWE-Bench Verified.
  • Figure 2: Comparison of issue resolution rates between our unified TTC framework (32B) and other LLMs across different repositories in SWE-bench Verified.
  • Figure 3: A unified view of Test-Time Scaling strategies for SWE agents. Internal TTC enhances reasoning depth through extended chain-of-thought training, while External TTC employs reward-guided search and verification to select optimal solutions. The hybrid approach combines both paradigms through iterative refinement.
  • Figure 4: Overview of Development-Process-Based Search Strategy.
  • Figure 5: Venn diagram of issue instances solved by our unified TTC framework and other models on SWE-bench Verified.
  • ...and 4 more figures