Table of Contents
Fetching ...

Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization

Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing

TL;DR

The paper tackles the challenge of building coding agents that operate over multi-turn, tool-using workflows, where traditional preference optimization methods risk diversity collapse and underutilize test-time compute. It introduces EntroPO, an entropy-regularized, multi-turn preference optimization framework that augments DPO/KTO with a diversity-promoting term and derives EntroPO-DPO and EntroPO-KTO losses. The authors provide theoretical analysis showing the entropy term boosts exploration for high-utility trajectories and identify a closed-form policy update, while a hybrid best-trajectory selector amplifies test-time gains. Empirically, EntroPO achieves state-of-the-art results among open-weight models on SWEBench benchmarks, with notable improvements for smaller models and robust performance under test-time scaling. The work highlights the importance of preserving diversity in offline preference learning to unlock the full potential of parallel rollouts for complex software engineering tasks.

Abstract

Software engineering presents complex, multi-step challenges for Large Language Models (LLMs), requiring reasoning over large codebases and coordinated tool use. The difficulty of these tasks is exemplified by benchmarks like SWE-bench, where current LLMs still struggle to resolve real-world issues. A promising approach to enhance performance is test-time scaling (TTS), but its gains are heavily dependent on the diversity of model outputs. While standard alignment methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are effective at aligning model outputs with human preferences, this process can come at the cost of reduced diversity, limiting the effectiveness of TTS. Additionally, existing preference optimization algorithms are typically designed for single-turn tasks and do not fully address the complexities of multi-turn reasoning and tool integration required for interactive coding agents. To bridge this gap, we introduce EntroPO, an entropy-enhanced framework that adapts existing preference optimization algorithms to the multi-turn, tool-assisted setting. EntroPO augments the preference objective to explicitly preserve policy entropy and generalizes learning to optimize over multi-turn interactions rather than single-turn responses. We validate EntroPO by fine-tuning a diverse suite of models from different families and sizes (up to 106B parameters).To maximize performance gains from TTS, we further propose a hybrid best-trajectory selection scheme combining a learned verifier model with model free approaches. On the SWEBENCH leaderboard, our approach establishes new state-of-the-art results among open-weight models. A 30B parameter model trained with EntroPO ranks 1st on SWEBENCH-LITE and 4th on SWEBENCH-VERIFIED on the open-weight leaderboard, surpassed only by models with over 10x more parameters(e.g., >$350B).

Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization

TL;DR

The paper tackles the challenge of building coding agents that operate over multi-turn, tool-using workflows, where traditional preference optimization methods risk diversity collapse and underutilize test-time compute. It introduces EntroPO, an entropy-regularized, multi-turn preference optimization framework that augments DPO/KTO with a diversity-promoting term and derives EntroPO-DPO and EntroPO-KTO losses. The authors provide theoretical analysis showing the entropy term boosts exploration for high-utility trajectories and identify a closed-form policy update, while a hybrid best-trajectory selector amplifies test-time gains. Empirically, EntroPO achieves state-of-the-art results among open-weight models on SWEBench benchmarks, with notable improvements for smaller models and robust performance under test-time scaling. The work highlights the importance of preserving diversity in offline preference learning to unlock the full potential of parallel rollouts for complex software engineering tasks.

Abstract

Software engineering presents complex, multi-step challenges for Large Language Models (LLMs), requiring reasoning over large codebases and coordinated tool use. The difficulty of these tasks is exemplified by benchmarks like SWE-bench, where current LLMs still struggle to resolve real-world issues. A promising approach to enhance performance is test-time scaling (TTS), but its gains are heavily dependent on the diversity of model outputs. While standard alignment methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are effective at aligning model outputs with human preferences, this process can come at the cost of reduced diversity, limiting the effectiveness of TTS. Additionally, existing preference optimization algorithms are typically designed for single-turn tasks and do not fully address the complexities of multi-turn reasoning and tool integration required for interactive coding agents. To bridge this gap, we introduce EntroPO, an entropy-enhanced framework that adapts existing preference optimization algorithms to the multi-turn, tool-assisted setting. EntroPO augments the preference objective to explicitly preserve policy entropy and generalizes learning to optimize over multi-turn interactions rather than single-turn responses. We validate EntroPO by fine-tuning a diverse suite of models from different families and sizes (up to 106B parameters).To maximize performance gains from TTS, we further propose a hybrid best-trajectory selection scheme combining a learned verifier model with model free approaches. On the SWEBENCH leaderboard, our approach establishes new state-of-the-art results among open-weight models. A 30B parameter model trained with EntroPO ranks 1st on SWEBENCH-LITE and 4th on SWEBENCH-VERIFIED on the open-weight leaderboard, surpassed only by models with over 10x more parameters(e.g., >$350B).

Paper Structure

This paper contains 27 sections, 4 theorems, 23 equations, 5 figures, 5 tables.

Key Result

Lemma 3.1

Consider a deterministic MDP, the total accumulated reward can be expressed in terms of the optimal policy $\pi^*$, the reference policy $\pi_{\text{ref}}$, and the initial value function $V^*_{1}(s_1)$ as follows:

Figures (5)

  • Figure 1: Overview of EntroPO with TTS. Given an issue and a repository, an LLM agent interacts with a sandboxed environment over multiple turns, receiving execution feedback. We run parallel rollouts to produce a pool of candidate trajectories. A hybrid selector ranks trajectories using a model-based verifier and model-free approaches, and selects the best trajectory to submit.
  • Figure 2: The Impact of Entropy Regularization on Test-Time Scaling. Performance of EntroPO-DPO, M-DPO, and SFT on SWEBench-Verified (left) and SWEBench-Lite (right) as the number of parallel rollouts ($N$) increases. EntroPO's entropy regularization consistently yields better scaling.
  • Figure 3: Ablation Studies on SWEBench-Verified. (Left) Performance contribution of each component in our hybrid selector at $N=16$. (Right) Sensitivity analysis of the hyperparameter $\zeta / \gamma$ for EntroPO-KTO. For visualization convenience, we plot the inverse of $\gamma / \zeta$ from \ref{['theorem:two_turn']}.
  • Figure 4: Impact of Temperature on Performance. Performance of multi-turn KTO and KTO+TTS on SWEBench-Verified with varying temperature, compared to EntroPO with a fixed temperature of 0.7. Increasing temperature fails to match the performance of EntroPO and degrades performance past 0.9.
  • Figure 5: The Impact of Entropy Regularization on Test-Time Scaling. Performance of EntroPO-KTO, M-KTO, and SFT on SWEBench-Verified (left) and SWEBench-Lite (right) as the number of parallel rollouts ($N$) increases. EntroPO's entropy regularization consistently yields better scaling.

Theorems & Definitions (4)

  • Lemma 3.1: Reward Sum Decomposition
  • Theorem 3.2
  • Lemma 3.3
  • Proposition 3.4: Mitigation of Diversity Collapse via Inverse Probability Weighting