Table of Contents
Fetching ...

CodeDPO: Aligning Code Models with Self Generated and Verified Source Code

Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, Zhi Jin

TL;DR

CodeDPO addresses the limitations of supervised fine-tuning for code generation by integrating preference learning through a self-generated, self-validated dataset and a PageRank-inspired ranking of code and tests. It employs Direct Preference Optimization (DPO) in conjunction with a robustness-enhancing loss to optimize both code correctness and execution efficiency without relying on external test resources. Empirical results across five benchmarks show substantial gains in correctness and notable runtime speedups, with an 83.5% HumanEval pass rate achieved on a 6.7B backbone. The framework offers a scalable foundation for offline code preference optimization and points to future work on readability and security considerations.

Abstract

Code generation models have shown significant potential for programming tasks. However, existing training methods like supervised fine-tuning face key limitations: they do not effectively teach models to prioritize correct over incorrect solutions in ambiguous situations, nor do they effectively optimize the runtime efficiency of the generated code. To address these challenges, we propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases. The underlying assumption is that test cases executable by multiple code snippets provide more reliable validation, and code that passes more tests is more likely to be correct. Through this self-validation process, our PageRank-inspired algorithm iteratively updates the ranking score of each code snippet, ultimately creating a code preference optimization dataset based on correctness and efficiency. CodeDPO is flexible and scalable, generating diverse preference optimization data without depending on external resources. Through comprehensive evaluations of five widely used benchmarks, CodeDPO demonstrates significant improvements in correctness and efficiency compared to existing methods. Our experiments prove that CodeDPO enhances the capabilities of LLMs in code generation and provides a robust foundation for conducting code preference optimization in more complex and challenging real-world scenarios.

CodeDPO: Aligning Code Models with Self Generated and Verified Source Code

TL;DR

CodeDPO addresses the limitations of supervised fine-tuning for code generation by integrating preference learning through a self-generated, self-validated dataset and a PageRank-inspired ranking of code and tests. It employs Direct Preference Optimization (DPO) in conjunction with a robustness-enhancing loss to optimize both code correctness and execution efficiency without relying on external test resources. Empirical results across five benchmarks show substantial gains in correctness and notable runtime speedups, with an 83.5% HumanEval pass rate achieved on a 6.7B backbone. The framework offers a scalable foundation for offline code preference optimization and points to future work on readability and security considerations.

Abstract

Code generation models have shown significant potential for programming tasks. However, existing training methods like supervised fine-tuning face key limitations: they do not effectively teach models to prioritize correct over incorrect solutions in ambiguous situations, nor do they effectively optimize the runtime efficiency of the generated code. To address these challenges, we propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases. The underlying assumption is that test cases executable by multiple code snippets provide more reliable validation, and code that passes more tests is more likely to be correct. Through this self-validation process, our PageRank-inspired algorithm iteratively updates the ranking score of each code snippet, ultimately creating a code preference optimization dataset based on correctness and efficiency. CodeDPO is flexible and scalable, generating diverse preference optimization data without depending on external resources. Through comprehensive evaluations of five widely used benchmarks, CodeDPO demonstrates significant improvements in correctness and efficiency compared to existing methods. Our experiments prove that CodeDPO enhances the capabilities of LLMs in code generation and provides a robust foundation for conducting code preference optimization in more complex and challenging real-world scenarios.
Paper Structure (42 sections, 2 equations, 4 figures, 17 tables)

This paper contains 42 sections, 2 equations, 4 figures, 17 tables.

Figures (4)

  • Figure 1: Log probabilities for code with varying correctness and efficiency during Phi-2-2.7B model training on our constructed dataset. The traditional SFT strategy struggles to teach models to prefer correct solutions over incorrect or slow ones. In contrast, our CodeDPO approach effectively optimizes for both correctness and efficiency.
  • Figure 2: Our CodeDPO involves four steps: ❶ Data Seed Construction with real-world source code; ❷ Correctness Optimization with self-validation score (in this figure we set $T$ to 2 and $d$ to 0.5. For simplicity, the final score in the figure is rounded to one decimal place. Details are shown in Appendix \ref{['sec:pythonimp']}); ❸ Efficiency Optimization with execution time on credible tests; ❹ Model Optimization Training.
  • Figure 3: Runtime Speedup and Percentage of Optimized Code on HumanEval+ and MBPP+.
  • Figure 4: Python Implementation of the Self-Validation Scores in Figure \ref{['fig:method']}.