Table of Contents
Fetching ...

Learning to Align Human Code Preferences

Xin Yin, Chao Ni, Xiaohu Yang

TL;DR

This work analyzes when Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) are most effective for aligning large language models with human code preferences, and introduces Adaptive Preference Optimization (APO) to dynamically merge the two. Through theoretical analysis and APPS-derived experiments across six code-preference tasks, it shows SFT excels on verifiable correctness tasks, while S&D (SFT followed by DPO) and APO improve performance on tasks lacking objective optimality, with APO offering comparable or superior results and a simpler training pipeline. The study provides practical guidance on strategy selection and demonstrates APO's potential to streamline code-preference alignment at scale, with competitive efficiency. Overall, APO unifies training for varied code-preference goals, enabling robust alignment without manual scenario discrimination.

Abstract

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human preferences, the optimal training strategy remains unclear across diverse code preference scenarios. This paper systematically investigates the roles of SFT and DPO in aligning LLMs with different code preferences. Through both theoretical analysis and empirical observation, we hypothesize that SFT excels in scenarios with objectively verifiable optimal solutions, while applying SFT followed by DPO (S&D) enables models to explore superior solutions in scenarios without objectively verifiable optimal solutions. Based on the analysis and experimental evidence, we propose Adaptive Preference Optimization (APO), a dynamic integration approach that adaptively amplifies preferred responses, suppresses dispreferred ones, and encourages exploration of potentially superior solutions during training. Extensive experiments across six representative code preference tasks validate our theoretical hypotheses and demonstrate that APO consistently matches or surpasses the performance of existing SFT and S&D strategies. Our work provides both theoretical foundations and practical guidance for selecting appropriate training strategies in different code preference alignment scenarios.

Learning to Align Human Code Preferences

TL;DR

This work analyzes when Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) are most effective for aligning large language models with human code preferences, and introduces Adaptive Preference Optimization (APO) to dynamically merge the two. Through theoretical analysis and APPS-derived experiments across six code-preference tasks, it shows SFT excels on verifiable correctness tasks, while S&D (SFT followed by DPO) and APO improve performance on tasks lacking objective optimality, with APO offering comparable or superior results and a simpler training pipeline. The study provides practical guidance on strategy selection and demonstrates APO's potential to streamline code-preference alignment at scale, with competitive efficiency. Overall, APO unifies training for varied code-preference goals, enabling robust alignment without manual scenario discrimination.

Abstract

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human preferences, the optimal training strategy remains unclear across diverse code preference scenarios. This paper systematically investigates the roles of SFT and DPO in aligning LLMs with different code preferences. Through both theoretical analysis and empirical observation, we hypothesize that SFT excels in scenarios with objectively verifiable optimal solutions, while applying SFT followed by DPO (S&D) enables models to explore superior solutions in scenarios without objectively verifiable optimal solutions. Based on the analysis and experimental evidence, we propose Adaptive Preference Optimization (APO), a dynamic integration approach that adaptively amplifies preferred responses, suppresses dispreferred ones, and encourages exploration of potentially superior solutions during training. Extensive experiments across six representative code preference tasks validate our theoretical hypotheses and demonstrate that APO consistently matches or surpasses the performance of existing SFT and S&D strategies. Our work provides both theoretical foundations and practical guidance for selecting appropriate training strategies in different code preference alignment scenarios.

Paper Structure

This paper contains 23 sections, 1 theorem, 7 equations, 4 figures, 12 tables.

Key Result

Theorem 1

We analyze the objectives of SFT and DPO under ideal optimization conditions to understand their fundamental differences. Assume that both optimization processes converge to their respective global minima. Given a preference dataset $\mathcal{D}$, let $\Pi_\mathrm{SFT}$ and $\Pi_\mathrm{DPO}$ denote

Figures (4)

  • Figure 1: The updated confidence score of different algorithms
  • Figure 2: Learning dynamics of SFT, DPO, and APO on code correctness preference
  • Figure 3: Learning dynamics of SFT, DPO, and APO on code efficiency preference
  • Figure 4: Number of best-performance instances across code preference scenarios

Theorems & Definitions (1)

  • Theorem 1