Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization

Jiahao Yu; Zelei Cheng; Xian Wu; Xinyu Xing

Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization

Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing

TL;DR

The paper tackles the challenge of building coding agents that operate over multi-turn, tool-using workflows, where traditional preference optimization methods risk diversity collapse and underutilize test-time compute. It introduces EntroPO, an entropy-regularized, multi-turn preference optimization framework that augments DPO/KTO with a diversity-promoting term and derives EntroPO-DPO and EntroPO-KTO losses. The authors provide theoretical analysis showing the entropy term boosts exploration for high-utility trajectories and identify a closed-form policy update, while a hybrid best-trajectory selector amplifies test-time gains. Empirically, EntroPO achieves state-of-the-art results among open-weight models on SWEBench benchmarks, with notable improvements for smaller models and robust performance under test-time scaling. The work highlights the importance of preserving diversity in offline preference learning to unlock the full potential of parallel rollouts for complex software engineering tasks.

Abstract

Software engineering presents complex, multi-step challenges for Large Language Models (LLMs), requiring reasoning over large codebases and coordinated tool use. The difficulty of these tasks is exemplified by benchmarks like SWE-bench, where current LLMs still struggle to resolve real-world issues. A promising approach to enhance performance is test-time scaling (TTS), but its gains are heavily dependent on the diversity of model outputs. While standard alignment methods such as Direct Preference Optimization (DPO) and Kahneman-Tversky Optimization (KTO) are effective at aligning model outputs with human preferences, this process can come at the cost of reduced diversity, limiting the effectiveness of TTS. Additionally, existing preference optimization algorithms are typically designed for single-turn tasks and do not fully address the complexities of multi-turn reasoning and tool integration required for interactive coding agents. To bridge this gap, we introduce EntroPO, an entropy-enhanced framework that adapts existing preference optimization algorithms to the multi-turn, tool-assisted setting. EntroPO augments the preference objective to explicitly preserve policy entropy and generalizes learning to optimize over multi-turn interactions rather than single-turn responses. We validate EntroPO by fine-tuning a diverse suite of models from different families and sizes (up to 106B parameters).To maximize performance gains from TTS, we further propose a hybrid best-trajectory selection scheme combining a learned verifier model with model free approaches. On the SWEBENCH leaderboard, our approach establishes new state-of-the-art results among open-weight models. A 30B parameter model trained with EntroPO ranks 1st on SWEBENCH-LITE and 4th on SWEBENCH-VERIFIED on the open-weight leaderboard, surpassed only by models with over 10x more parameters(e.g., >$350B).

Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization

TL;DR

Abstract

Building Coding Agents via Entropy-Enhanced Multi-Turn Preference Optimization

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (4)