Table of Contents
Fetching ...

Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

Alexander Golubev, Maria Trofimova, Sergei Polezhaev, Ibragim Badertdinov, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Sergey Abramov, Andrei Andriushchenko, Filipp Fisin, Sergei Skvortsov, Boris Yangel

TL;DR

This work addresses training long-context, multi-turn software engineering agents by introducing a two-phase pipeline that combines rejection fine-tuning (RFT) with a DAPO-based multi-turn reinforcement learning regime on open-weight models. Starting from a 72B open model, the approach achieves a substantial boost in Pass@1 on SWE-bench Verified (11% to 39%) and competitive performance on SWE-rebench splits (35% May, 31% June), demonstrating that RL can scale to long-horizon, interactive coding tasks without teacher models. It provides detailed analyses of training stability, data curation, and the impact of sampling strategies, while highlighting challenges such as sparse rewards and the need for better uncertainty estimation. Overall, the results suggest that open-weight, long-context SWE agents trained with RL can approach the capabilities of larger, proprietary systems and offer a practical blueprint for future multi-turn, interactive AI in software engineering contexts.

Abstract

Research on applications of reinforcement learning (RL) to large language models has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn Markov decision processes (MDPs), this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Our methodology begins with rejection fine-tuning (RFT) using execution feedback to train a policy to follow instructions and formatting effectively, followed by a synchronous RL pipeline using DAPO for iterative improvement. Applying this pipeline to Qwen2.5-72B-Instruct, we increase its Pass@1 on the SWE-bench Verified benchmark from 11% to 39%, substantially improving upon the 20% RFT baseline. On the May and June splits of SWE-rebench, the resulting agent achieves Pass@1 of 35% and 31% respectively, competitive with even larger models such as DeepSeek-V3-0324 or Qwen3-235B-A22B, demonstrating that our methodology offers a practical approach for training capable agents for multi-turn interactive tasks using open-weight models.

Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

TL;DR

This work addresses training long-context, multi-turn software engineering agents by introducing a two-phase pipeline that combines rejection fine-tuning (RFT) with a DAPO-based multi-turn reinforcement learning regime on open-weight models. Starting from a 72B open model, the approach achieves a substantial boost in Pass@1 on SWE-bench Verified (11% to 39%) and competitive performance on SWE-rebench splits (35% May, 31% June), demonstrating that RL can scale to long-horizon, interactive coding tasks without teacher models. It provides detailed analyses of training stability, data curation, and the impact of sampling strategies, while highlighting challenges such as sparse rewards and the need for better uncertainty estimation. Overall, the results suggest that open-weight, long-context SWE agents trained with RL can approach the capabilities of larger, proprietary systems and offer a practical blueprint for future multi-turn, interactive AI in software engineering contexts.

Abstract

Research on applications of reinforcement learning (RL) to large language models has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn Markov decision processes (MDPs), this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Our methodology begins with rejection fine-tuning (RFT) using execution feedback to train a policy to follow instructions and formatting effectively, followed by a synchronous RL pipeline using DAPO for iterative improvement. Applying this pipeline to Qwen2.5-72B-Instruct, we increase its Pass@1 on the SWE-bench Verified benchmark from 11% to 39%, substantially improving upon the 20% RFT baseline. On the May and June splits of SWE-rebench, the resulting agent achieves Pass@1 of 35% and 31% respectively, competitive with even larger models such as DeepSeek-V3-0324 or Qwen3-235B-A22B, demonstrating that our methodology offers a practical approach for training capable agents for multi-turn interactive tasks using open-weight models.

Paper Structure

This paper contains 30 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Illustration of task structure differences between bandit-style problems (top, e.g., math) and POMDPs (bottom, e.g., software engineering), defined in Section \ref{['sec:preliminaries:task_formulation']}. In bandit settings, the agent takes a single action to produce a final solution based on an initial observation. In contrast, POMDPs require a multi-step interaction loop where the agent repeatedly takes actions and interprets new environmental feedback to guide its subsequent decisions.
  • Figure 2: An example trajectory from the agent's interaction used in RFT. Only green (error-free) assistant turns contribute to training loss.
  • Figure 3: A detailed performance trend of the RL-trained agent over all iterations. Statistics include Pass@1, Pass@10, the number of submit commands and the average number of steps per trajectory. All metrics are computed on Verified-50.
  • Figure 4: One synchronous iteration of the RL pipeline (green: GPU heavy; yellow: CPU heavy).