Table of Contents
Fetching ...

BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment

Sizhe Wang, Yongqi Tong, Hengyuan Zhang, Dawei Li, Xin Zhang, Tianlong Chen

TL;DR

This work identifies an imbalance between knowledge breadth and depth in RLHF alignment data, arising from the typical $(n,2)$ structure of instruction–response samples. It proposes Balanced Preference Optimization (BPO), a two-stage framework that first compresses breadth with embedding-based clustering and then dynamically augments depth using gradient-informed sampling, followed by DPO fine-tuning. Empirical results across HH-RLHF, SafeRLHF, MT-Bench, AlpacaEval, and UltraFeedback show that BPO achieves strong alignment performance with far less data than baselines, outperforming previous data-optimization methods while maintaining efficiency. The study also provides insights into clustering choices and depth-measurement strategies, offering practical guidelines for future preference-data optimization research.

Abstract

Reinforcement Learning with Human Feedback (RLHF) is the key to the success of large language models (LLMs) in recent years. In this work, we first introduce the concepts of knowledge breadth and knowledge depth, which measure the comprehensiveness and depth of an LLM or knowledge source respectively. We reveal that the imbalance in the number of prompts and responses can lead to a potential disparity in breadth and depth learning within alignment tuning datasets by showing that even a simple uniform method for balancing the number of instructions and responses can lead to significant improvements. Building on this, we further propose Balanced Preference Optimization (BPO), designed to dynamically augment the knowledge depth of each sample. BPO is motivated by the observation that the usefulness of knowledge varies across samples, necessitating tailored learning of knowledge depth. To achieve this, we introduce gradient-based clustering, estimating the knowledge informativeness and usefulness of each augmented sample based on the model's optimization direction. Our experimental results across various benchmarks demonstrate that BPO outperforms other baseline methods in alignment tuning while maintaining training efficiency. Furthermore, we conduct a detailed analysis of each component of BPO, providing guidelines for future research in preference data optimization.

BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment

TL;DR

This work identifies an imbalance between knowledge breadth and depth in RLHF alignment data, arising from the typical structure of instruction–response samples. It proposes Balanced Preference Optimization (BPO), a two-stage framework that first compresses breadth with embedding-based clustering and then dynamically augments depth using gradient-informed sampling, followed by DPO fine-tuning. Empirical results across HH-RLHF, SafeRLHF, MT-Bench, AlpacaEval, and UltraFeedback show that BPO achieves strong alignment performance with far less data than baselines, outperforming previous data-optimization methods while maintaining efficiency. The study also provides insights into clustering choices and depth-measurement strategies, offering practical guidelines for future preference-data optimization research.

Abstract

Reinforcement Learning with Human Feedback (RLHF) is the key to the success of large language models (LLMs) in recent years. In this work, we first introduce the concepts of knowledge breadth and knowledge depth, which measure the comprehensiveness and depth of an LLM or knowledge source respectively. We reveal that the imbalance in the number of prompts and responses can lead to a potential disparity in breadth and depth learning within alignment tuning datasets by showing that even a simple uniform method for balancing the number of instructions and responses can lead to significant improvements. Building on this, we further propose Balanced Preference Optimization (BPO), designed to dynamically augment the knowledge depth of each sample. BPO is motivated by the observation that the usefulness of knowledge varies across samples, necessitating tailored learning of knowledge depth. To achieve this, we introduce gradient-based clustering, estimating the knowledge informativeness and usefulness of each augmented sample based on the model's optimization direction. Our experimental results across various benchmarks demonstrate that BPO outperforms other baseline methods in alignment tuning while maintaining training efficiency. Furthermore, we conduct a detailed analysis of each component of BPO, providing guidelines for future research in preference data optimization.

Paper Structure

This paper contains 43 sections, 13 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Overview of knowledge breadth and depth, and how we link them with the number of instructions and responses in alignment tuning datasets.
  • Figure 2: Preliminary experiment results on SafeRLHF using simple balance.
  • Figure 3: Overview of BPO pipeline. BPO first selects representative prompts to reduce knowledge breadth through embedding-based clustering. Next, it generates responses using the SFT policy and employs GPT-4 to score these responses to uniformly construct response pairs. Subsequently, BPO samples response pairs to dynamically augment knowledge depth through gradient-based clustering. Finally, DPO is applied to the sampled data, which ensures efficient alignment.
  • Figure 4: Dynamic allocation of response pairs based on gradient clustering. Different prompts require varying numbers of pairs. While most prompts can be adequately represented with a single response pair, certain prompts demand a more comprehensive exploration involving additional pairs.
  • Figure 5: Experimental results for varying numbers of clusters during embedding-based and gradient-based K-means The top $\eta = 10\%$ of data points are selected based on Equation \ref{['rank']} for all clustering tasks. All experiments are conducted on $\text{Llama-3}_\text{8B}$.
  • ...and 1 more figures