Table of Contents
Fetching ...

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

Junming Yang, Ning Xu, Biao Liu, Shiqi Qiao, Xin Geng

TL;DR

The proposed Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training, and employs a lightweight meta-learner, as an"alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data.

Abstract

Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs. Code is available at https://github.com/junming-yang/MetaAPO.

Alignment through Meta-Weighted Online Sampling: Bridging the Gap between Data Generation and Preference Optimization

TL;DR

The proposed Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training, and employs a lightweight meta-learner, as an"alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data.

Abstract

Preference optimization is crucial for aligning large language models (LLMs) with human values and intentions. A significant challenge in this process is the distribution mismatch between pre-collected offline preference data and the evolving model policy. Existing methods attempt to reduce this gap using static heuristics or decoupled online sampling strategies, but they often fail to adapt to the model's dynamic learning state. To bridge this gap, we propose Meta-Weighted Adaptive Preference Optimization (MetaAPO), a novel framework that dynamically couples data generation with model training. MetaAPO employs a lightweight meta-learner, as an "alignment gap estimator", to evaluate the potential benefits of on-policy sampling in relation to offline data. This guides targeted online generation and assigns sample-wise meta-weights to the optimization objective, dynamically balancing the quality and distribution of online and offline data. Experiments on AlpacaEval 2, Arena-Hard and MT-Bench demonstrate that MetaAPO consistently outperforms existing preference optimization approaches across various settings, while reducing 42% in online annotation costs. Code is available at https://github.com/junming-yang/MetaAPO.

Paper Structure

This paper contains 37 sections, 1 theorem, 20 equations, 10 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Let $\hat{h}_{\boldsymbol{\phi}}$ denote the meta-learner function learned by minimizing the empirical meta-risk over the meta-buffer $\mathcal{B}_\text{meta}$ of size $m$, and let $h^*$ be the oracle function that minimizes the true meta-risk over the hypothesis space $\mathcal{H}$. Assume the meta where $R(h_{\boldsymbol{\phi}})$ and $\widehat{R}_m(h_{\boldsymbol{\phi}})$ denote the true and emp

Figures (10)

  • Figure 1: Left: Overview of MetaAPO. MetaAPO employs a meta-learner to couple online data generation (top) and model training (bottom). The meta-learner adaptively assigns weights by evaluating offline data, guiding both targeted online sample generation and training on the weighted combination of offline and online samples. Right: Performance (left y-axis, line plots) and online generation and annotation ratio relative to Online DPO (right y-axis, bar plots) across training iterations for different methods.
  • Figure 2: Left: The dynamic changes in Offline Score (left y-axis) and Reward (right y-axis) across training iterations. Middle: Scatter plot of offline implicit reward margin ($m_\text{off}(\cdot)$) versus the online-offline implicit reward margin gap ($m_\text{on}(\cdot) - m_\text{off}(\cdot)$). Points are colored by their sampling status: blue for "Sampled" (selected for online generation) and orange for "Unsampled". Right: Comparison of independent reward score distributions (via kernel density estimation) for testset responses generated by DPO and MetaAPO.
  • Figure 3: Left: Impact of meta-Learner update interval ($T_\text{meta}$) on MetaAPO performance. Performance (AlpacaEval 2 WR/LC (%)) is shown for different $T_\text{meta}$ values. Middel: Meta-Learner input-output relationship across training iterations. Right: Comparison of offline implicit reward margin $m^\text{off}(\cdot)$ distribution and online implicit reward margin $m^\text{on}(\cdot)$ distributions (via histogram density estimation).
  • Figure 4: Left: Impact of rollout generation size ($K$) on MetaAPO performance. Right: Human evaluation results for model response win rates.
  • Figure 5: A concise alignment example.
  • ...and 5 more figures

Theorems & Definitions (4)

  • Theorem 1: Generalization Bound for Meta-Learner
  • Definition 1: Meta-Loss Function
  • Definition 2: True Risk and Empirical Risk
  • Definition 3: Rademacher Complexity