CodeDPO: Aligning Code Models with Self Generated and Verified Source Code

Kechi Zhang; Ge Li; Yihong Dong; Jingjing Xu; Jun Zhang; Jing Su; Yongfei Liu; Zhi Jin

CodeDPO: Aligning Code Models with Self Generated and Verified Source Code

Kechi Zhang, Ge Li, Yihong Dong, Jingjing Xu, Jun Zhang, Jing Su, Yongfei Liu, Zhi Jin

TL;DR

CodeDPO addresses the limitations of supervised fine-tuning for code generation by integrating preference learning through a self-generated, self-validated dataset and a PageRank-inspired ranking of code and tests. It employs Direct Preference Optimization (DPO) in conjunction with a robustness-enhancing loss to optimize both code correctness and execution efficiency without relying on external test resources. Empirical results across five benchmarks show substantial gains in correctness and notable runtime speedups, with an 83.5% HumanEval pass rate achieved on a 6.7B backbone. The framework offers a scalable foundation for offline code preference optimization and points to future work on readability and security considerations.

Abstract

Code generation models have shown significant potential for programming tasks. However, existing training methods like supervised fine-tuning face key limitations: they do not effectively teach models to prioritize correct over incorrect solutions in ambiguous situations, nor do they effectively optimize the runtime efficiency of the generated code. To address these challenges, we propose CodeDPO, a framework that integrates preference learning into code generation to improve two key code preference factors: code correctness and efficiency. CodeDPO employs a novel dataset construction method, utilizing a self-generation-and-validation mechanism that simultaneously generates and evaluates code and test cases. The underlying assumption is that test cases executable by multiple code snippets provide more reliable validation, and code that passes more tests is more likely to be correct. Through this self-validation process, our PageRank-inspired algorithm iteratively updates the ranking score of each code snippet, ultimately creating a code preference optimization dataset based on correctness and efficiency. CodeDPO is flexible and scalable, generating diverse preference optimization data without depending on external resources. Through comprehensive evaluations of five widely used benchmarks, CodeDPO demonstrates significant improvements in correctness and efficiency compared to existing methods. Our experiments prove that CodeDPO enhances the capabilities of LLMs in code generation and provides a robust foundation for conducting code preference optimization in more complex and challenging real-world scenarios.

CodeDPO: Aligning Code Models with Self Generated and Verified Source Code

TL;DR

Abstract

Paper Structure (42 sections, 2 equations, 4 figures, 17 tables)

This paper contains 42 sections, 2 equations, 4 figures, 17 tables.

Introduction
Related Work
Large Language Models for Code
Preference Optimization for Code Models
CodeDPO: Self-Verified Performance Optimization Code Generation Framework
Data Seed Construction
Correctness Optimization with Self-Generation and Validation
Ranking Code Snippets and Test Cases Using Self-Validation Scores
Execution Efficiency Optimization
Final Dataset and Model Optimization
Experiment Setup
Backbone LLMs
Training and Inference Settings
Results and Analyses
Code Correctness (RQ1)
...and 27 more sections

Figures (4)

Figure 1: Log probabilities for code with varying correctness and efficiency during Phi-2-2.7B model training on our constructed dataset. The traditional SFT strategy struggles to teach models to prefer correct solutions over incorrect or slow ones. In contrast, our CodeDPO approach effectively optimizes for both correctness and efficiency.
Figure 2: Our CodeDPO involves four steps: ❶ Data Seed Construction with real-world source code; ❷ Correctness Optimization with self-validation score (in this figure we set $T$ to 2 and $d$ to 0.5. For simplicity, the final score in the figure is rounded to one decimal place. Details are shown in Appendix \ref{['sec:pythonimp']}); ❸ Efficiency Optimization with execution time on credible tests; ❹ Model Optimization Training.
Figure 3: Runtime Speedup and Percentage of Optimized Code on HumanEval+ and MBPP+.
Figure 4: Python Implementation of the Self-Validation Scores in Figure \ref{['fig:method']}.

CodeDPO: Aligning Code Models with Self Generated and Verified Source Code

TL;DR

Abstract

CodeDPO: Aligning Code Models with Self Generated and Verified Source Code

Authors

TL;DR

Abstract

Table of Contents

Figures (4)