Table of Contents
Fetching ...

Teaching Your Models to Understand Code via Focal Preference Alignment

Jie Wu, Haoling Li, Xin Zhang, Xiao Liu, Yangyu Huang, Jianwen Luo, Yizhen Zhang, Zuchao Li, Ruihang Chu, Yujiu Yang, Scarlett Li

TL;DR

The paper tackles label noise and signal dilution in preference learning for Code LLMs by introducing Target-DPO, a framework that mirrors human debugging to locate error regions and perform token-level focal alignment. It builds CodeFlow, a dataset of iterative code refinements and corresponding token edits, and presents a tailored DPO objective that masks irrelevant tokens and emphasizes error-correcting edits. Across five public benchmarks, Target-DPO with 59k preference pairs demonstrates consistent gains over standard DPO, RPO, and SFT baselines, achieving strong performance on challenging tasks like BigCodeBench and reducing common error types. The work advances practical code generation by enabling finer-grained learning signals, offering a scalable pathway for more reliable Code LLMs, with limitations and future directions discussed for larger-scale validation.

Abstract

Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.

Teaching Your Models to Understand Code via Focal Preference Alignment

TL;DR

The paper tackles label noise and signal dilution in preference learning for Code LLMs by introducing Target-DPO, a framework that mirrors human debugging to locate error regions and perform token-level focal alignment. It builds CodeFlow, a dataset of iterative code refinements and corresponding token edits, and presents a tailored DPO objective that masks irrelevant tokens and emphasizes error-correcting edits. Across five public benchmarks, Target-DPO with 59k preference pairs demonstrates consistent gains over standard DPO, RPO, and SFT baselines, achieving strong performance on challenging tasks like BigCodeBench and reducing common error types. The work advances practical code generation by enabling finer-grained learning signals, offering a scalable pathway for more reliable Code LLMs, with limitations and future directions discussed for larger-scale validation.

Abstract

Preference learning extends the performance of Code LLMs beyond traditional supervised fine-tuning by leveraging relative quality comparisons. In existing approaches, a set of n candidate solutions is evaluated based on test case success rates, with the candidate demonstrating a higher pass rate being labeled as positive and its counterpart with a lower pass rate as negative. However, because this approach aligns entire failing code blocks rather than pinpointing specific errors, it lacks the granularity necessary to capture meaningful error-correction relationships. As a result, the model is unable to learn more informative error-correction patterns. To address these issues, we propose Target-DPO, a new preference alignment framework that mimics human iterative debugging to refine Code LLMs. Target-DPO explicitly locates error regions and aligns the corresponding tokens via a tailored DPO algorithm. To facilitate it, we introduce the CodeFlow dataset, where samples are iteratively refined until passing tests, with modifications capturing error corrections. Extensive experiments show that a diverse suite of Code LLMs equipped with Target-DPO achieves significant performance gains in code generation and improves on challenging tasks like BigCodeBench. In-depth analysis reveals that Target-DPO yields fewer errors. Code, model and datasets are in: https://github.com/JieWu02/Target-DPO.

Paper Structure

This paper contains 35 sections, 10 equations, 11 figures, 21 tables, 1 algorithm.

Figures (11)

  • Figure 1: Target-DPO achieves significant performance gains over DPO variants on challenging coding tasks, i.e., BigCodeBench-Hard, with Qwen2.5-Coder-7B.
  • Figure 2: In LLM-generated code, errors are usually confined to critical parts. Minor adjustments to the corresponding erroneous tokens can correct the code while leaving the majority unchanged. Therefore, an effective error correction requires first identifying the key error lines and then performing focal alignment.
  • Figure 3: Method Overview. Target-DPO constructs preference pairs via iterative debugging, treating the correct version as preferred and the previous as dispreferred. DPO adaptations enable code LLMs to learn the correct pattern from the preferred code while highlighting critical tokens with a masking strategy in the dispreferred sample.
  • Figure 4: Comparison with CodeDPO, PLUM, and Code-Optimise using DeepSeekCoder-6.7B. Additional results are provided in Appendix \ref{['subsec:appendix_comparison_other_methods']}.
  • Figure 5: Illustration for Target-DPO and its ablations. Target-DPO rewards correct code tokens while penalizing only error-specific tokens in rejected code, teaching models to truly understand code through targeted alignment.
  • ...and 6 more figures