SEAlign: Alignment Training for Software Engineering Agent
Kechi Zhang, Huangzhao Zhang, Ge Li, Jinliang You, Jia Li, Yunfei Zhao, Zhi Jin
TL;DR
SEAlign addresses the gap between post-training alignment of code models and real-world software engineering tasks by leveraging agentic trajectories and a two-stage alignment pipeline. It collects high-quality trajectories, constructs trajectory trees, scores decision nodes to identify critical actions, and applies SFT followed by fine-grained DPO on critical action pairs to align models with real-world workflows. Across SWE-Bench-Lite, SWE-Bench-Verified, and HumanEvalFix, SEAlign delivers state-of-the-art results with modest training data and demonstrates gains in both task performance and user experience, including automated app generation with improved usability. The approach also discusses generalization, practical overheads, and threats to validity, arguing SEAlign as a meaningful step toward scalable, fully automated software engineering powered by LLMs.
Abstract
Recent advances in code generation models have demonstrated impressive capabilities in automating software development tasks, yet these models still struggle in real-world software engineering scenarios. Although current training methods, particularly post-training, excel at solving competitive programming problems, they fail to adequately prepare models for the complexities of practical software development. This misalignment raises the critical question: Are existing alignment training methods well suited for real-world software engineering tasks? In this study, we identify this issue and propose SEAlign, a novel alignment framework designed to bridge the gap between code generation models and real-world software development tasks. SEAlign leverages the unique characteristics of software engineering processes, including high-quality workflow steps, to enhance model capabilities. Our framework further employs Monte Carlo Tree Search for fine-grained alignment in multi-step decision processes, followed by preference optimization on critical actions to ensure models meet real-world requirements. We evaluate SEAlign on three standard agentic benchmarks for real-world software engineering, including HumanEvalFix, SWE-Bench-Lite, and SWE-Bench-Verified. Experimental results demonstrate state-of-the-art performance with minimal training overhead. In addition, we develop an agent-based software development platform using SEAlign, which successfully automates the creation of several small applications. Human evaluations of these applications highlight significant improvements in both task performance and user experience. Our findings underscore the potential of SEAlign to accelerate the adoption of large code models in real-world software development. We believe that this research makes a meaningful step towards fully automated software engineering.
