Teaching Your Models to Understand Code via Focal Preference Alignment
Jie Wu, Haoling Li, Xin Zhang, Xiao Liu, Yangyu Huang, Jianwen Luo, Yizhen Zhang, Zuchao Li, Ruihang Chu, Yujiu Yang, Scarlett Li
TL;DR
The paper tackles label noise and signal dilution in preference learning for Code LLMs by introducing Target-DPO, a framework that mirrors human debugging to locate error regions and perform token-level focal alignment. It builds CodeFlow, a dataset of iterative code refinements and corresponding token edits, and presents a tailored DPO objective that masks irrelevant tokens and emphasizes error-correcting edits. Across five public benchmarks, Target-DPO with 59k preference pairs demonstrates consistent gains over standard DPO, RPO, and SFT baselines, achieving strong performance on challenging tasks like BigCodeBench and reducing common error types. The work advances practical code generation by enabling finer-grained learning signals, offering a scalable pathway for more reliable Code LLMs, with limitations and future directions discussed for larger-scale validation.
Abstract
Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.
