Table of Contents
Fetching ...

Learning Code Preference via Synthetic Evolution

Jiawei Liu, Thanh Nguyen, Mingyue Shang, Hantian Ding, Xiaopeng Li, Yu Yu, Varun Kumar, Zijian Wang

TL;DR

The paper tackles learning code preferences for LLM-based code generation by introducing CodeFavor, a pairwise-preference framework trained on synthetic evolution data from code commits and critiques. It couples Commit-Instruct and Critic-Evol data pipelines with classification and generation outputs, evaluated on CodePrefBench, a 1,364-task benchmark across correctness, efficiency, security, and human preferences. Empirical results show CodeFavor can outperform baselines by up to 28.8% in accuracy and match much larger models at a fraction of the cost, while revealing human preferences are most reliable for correctness but costlier and less effective for non-functional criteria. The work demonstrates the viability of synthetic evolution for scalable code-preference learning and provides nuanced insights into design choices, data sources, and the trade-offs between human and model-based judgments.

Abstract

Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of code preference learning: (i) How do we train models to predict meaningful preferences for code? and (ii) How do human and LLM preferences align with verifiable code properties and developer code tastes? To this end, we propose CodeFavor, a framework for training pairwise code preference models from synthetic evolution data, including code commits and code critiques. To evaluate code preferences, we introduce CodePrefBench, a benchmark comprising 1364 rigorously curated code preference tasks to cover three verifiable properties-correctness, efficiency, and security-along with human preference. Our evaluation shows that CodeFavor holistically improves the accuracy of model-based code preferences by up to 28.8%. Meanwhile, CodeFavor models can match the performance of models with 6-9x more parameters while being 34x more cost-effective. We also rigorously validate the design choices in CodeFavor via a comprehensive set of controlled experiments. Furthermore, we discover the prohibitive costs and limitations of human-based code preference: despite spending 23.4 person-minutes on each task, 15.1-40.3% of tasks remain unsolved. Compared to model-based preference, human preference tends to be more accurate under the objective of code correctness, while being sub-optimal for non-functional objectives.

Learning Code Preference via Synthetic Evolution

TL;DR

The paper tackles learning code preferences for LLM-based code generation by introducing CodeFavor, a pairwise-preference framework trained on synthetic evolution data from code commits and critiques. It couples Commit-Instruct and Critic-Evol data pipelines with classification and generation outputs, evaluated on CodePrefBench, a 1,364-task benchmark across correctness, efficiency, security, and human preferences. Empirical results show CodeFavor can outperform baselines by up to 28.8% in accuracy and match much larger models at a fraction of the cost, while revealing human preferences are most reliable for correctness but costlier and less effective for non-functional criteria. The work demonstrates the viability of synthetic evolution for scalable code-preference learning and provides nuanced insights into design choices, data sources, and the trade-offs between human and model-based judgments.

Abstract

Large Language Models (LLMs) have recently demonstrated remarkable coding capabilities. However, assessing code generation based on well-formed properties and aligning it with developer preferences remains challenging. In this paper, we explore two key questions under the new challenge of code preference learning: (i) How do we train models to predict meaningful preferences for code? and (ii) How do human and LLM preferences align with verifiable code properties and developer code tastes? To this end, we propose CodeFavor, a framework for training pairwise code preference models from synthetic evolution data, including code commits and code critiques. To evaluate code preferences, we introduce CodePrefBench, a benchmark comprising 1364 rigorously curated code preference tasks to cover three verifiable properties-correctness, efficiency, and security-along with human preference. Our evaluation shows that CodeFavor holistically improves the accuracy of model-based code preferences by up to 28.8%. Meanwhile, CodeFavor models can match the performance of models with 6-9x more parameters while being 34x more cost-effective. We also rigorously validate the design choices in CodeFavor via a comprehensive set of controlled experiments. Furthermore, we discover the prohibitive costs and limitations of human-based code preference: despite spending 23.4 person-minutes on each task, 15.1-40.3% of tasks remain unsolved. Compared to model-based preference, human preference tends to be more accurate under the objective of code correctness, while being sub-optimal for non-functional objectives.
Paper Structure (24 sections, 2 equations, 18 figures, 6 tables)

This paper contains 24 sections, 2 equations, 18 figures, 6 tables.

Figures (18)

  • Figure 1: Approach overview of CodeFavor. We train a pairwise preference model using synthetic data created from two complementary sources of code evolution: Commit-Instruct and Critic-Evol.
  • Figure 2: Developer confidence distribution.
  • Figure 4: Estimated per-sample cost and accuracy.
  • Figure 5: Exemplifying prompts in Commit-Instruct for generating preference code pairs.
  • Figure 6: A filtered commit in Commit-Instruct for not being clearly useful.
  • ...and 13 more figures