Table of Contents
Fetching ...

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

Xiaoqiang Lin, Arun Verma, Zhongxiang Dai, Daniela Rus, See-Kiong Ng, Bryan Kian Hsiang Low

TL;DR

ActiveDPO tackles the challenge of aligning LLMs with human preferences efficiently by introducing a gradient-based, uncertainty-driven data selection criterion that is grounded in neural dueling bandits theory and uses the LLM itself as an implicit, non-linear reward model. The method regenerates prompt–response pairs each iteration, selects informative triplets via the criterion ||∇ r_{θ_{t-1}}(x,y1) − ∇ r_{θ_{t-1}}(x,y2)||_{V_{t-1}^{−1}}, and updates the model with Direct Preference Optimization (DPO). Key contributions include a theoretical bound on reward-difference estimation error, batch selection, LoRA gradient random projection, and gradient normalization to improve practicality, with extensive experiments showing consistent improvements over baselines on TLDR and WebGPT across multiple LLMs. This work reduces labeling cost for high-quality alignment and provides a pathway toward scalable, theory-guided active preference learning for large language models.

Abstract

The recent success of using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks like question answering, mathematical reasoning, and code generation. However,3 achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions (e.g., linearity). To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Extensive experiments show that ActiveDPO outperforms existing methods across various models and datasets.

ActiveDPO: Active Direct Preference Optimization for Sample-Efficient Alignment

TL;DR

ActiveDPO tackles the challenge of aligning LLMs with human preferences efficiently by introducing a gradient-based, uncertainty-driven data selection criterion that is grounded in neural dueling bandits theory and uses the LLM itself as an implicit, non-linear reward model. The method regenerates prompt–response pairs each iteration, selects informative triplets via the criterion ||∇ r_{θ_{t-1}}(x,y1) − ∇ r_{θ_{t-1}}(x,y2)||_{V_{t-1}^{−1}}, and updates the model with Direct Preference Optimization (DPO). Key contributions include a theoretical bound on reward-difference estimation error, batch selection, LoRA gradient random projection, and gradient normalization to improve practicality, with extensive experiments showing consistent improvements over baselines on TLDR and WebGPT across multiple LLMs. This work reduces labeling cost for high-quality alignment and provides a pathway toward scalable, theory-guided active preference learning for large language models.

Abstract

The recent success of using human preferences to align large language models (LLMs) has significantly improved their performance in various downstream tasks like question answering, mathematical reasoning, and code generation. However,3 achieving effective LLM alignment depends on high-quality human preference datasets. Collecting these datasets requires human preference annotation, which is costly and resource-intensive, necessitating efficient active data selection methods. Existing methods either lack a strong theoretical foundation or depend on restrictive reward function assumptions (e.g., linearity). To this end, we propose an algorithm, ActiveDPO, that uses a theoretically grounded data selection criterion for non-linear reward functions while directly leveraging the LLM itself to parameterize the reward model that is used for active data selection. As a result, ActiveDPO explicitly accounts for the influence of LLM on data selection, unlike methods that select the data without considering the LLM that is being aligned, thereby leading to more effective and efficient data collection. Extensive experiments show that ActiveDPO outperforms existing methods across various models and datasets.

Paper Structure

This paper contains 20 sections, 4 theorems, 13 equations, 5 figures, 1 algorithm.

Key Result

Proposition 1

Let $r_{\theta}$ denote a fully connected neural network with a width of $m$ in each layer and depth of $L$. Let $\delta \in (0,1)$. Assume that there is a ground true reward function $r$ and that human preference is sampled from BTL preference modeling. As long as $m \ge M$, then with a probability for all $x \in \mathcal{X}$ and $y_1,y_2 \in \mathcal{Y}, t \in [T]$ when using the DPO objective d

Figures (5)

  • Figure 1: Comparison of average rewards for responses generated by the LLM using different selection strategies.
  • Figure 2: Different models require different data to achieve good alignment performance. We train the Gemma model using two different SFT datasets to obtain Model 1 and Model 2. We construct 3 different human preference datasets and perform DPO training on these 3 datasets for these two models, respectively.
  • Figure 3: Effect of normalizing LoRA gradients on the performance of ActiveDPO.
  • Figure 4: Effect of Random Projection Dimensionality of LoRA gradients.
  • Figure 5: Comparison of the win-rate of the responses generated by the LLM trained by DPO with the responses generated by the initial LLM with different selection strategies.

Theorems & Definitions (8)

  • Proposition 1: Estimation error of the reward difference (informal version of \ref{['prop:formal-main-prop']})
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Proposition 2: Formal version of \ref{['prop:reward difference']}
  • proof
  • Remark 1