Table of Contents
Fetching ...

Plug-and-Play Training Framework for Preference Optimization

Jingyuan Ma, Rui Li, Zheng Li, Lei Sha, Zhifang Sui

TL;DR

The paper addresses the limitation of uniform sample weighting in Preference Optimization (PO) methods when training LLMs for high-precision tasks like mathematical reasoning. It introduces a plug-and-play weighted training framework that uses multiple sampling to estimate output distributions, derives dynamic sample weights based on observed correctness and errors, and integrates these weights into pairwise PO objectives (e.g., DPO, DPOP, IPO, SimPO). The approach centers on data collection via $N$-fold sampling, weight computation using statistics $P_c$, $P_e$, $N$, and $\epsilon$, and a Bradley–Terry–style training objective that emphasizes informative pairs. Experimental results on GSM8K and MATH show consistent improvements across multiple PO baselines and model series, with analyses linking stability and reward dynamics to the weighting strategy. The framework offers a practical, modular enhancement to RLHF-style alignment, particularly improving mathematical reasoning, while noting limitations related to defining equivalence classes for responses and suggesting future work on semantic clustering to generalize beyond clearly correct answers.

Abstract

Recently, preference optimization methods such as DPO have significantly enhanced large language models (LLMs) in wide tasks including dialogue and question-answering. However, current methods fail to account for the varying difficulty levels of training samples during preference optimization, leading to mediocre performance in tasks with high accuracy requirements, particularly in mathematical reasoning. To address this limitation, we propose a novel training framework, which employs multiple sampling to analyze output distributions, assign different weights to samples, and incorporate these weights into the preference optimization process. This plug-and-play approach enables LLMs to prioritize challenging examples during training, improving learning efficiency. Experimental results demonstrate that our framework integrates seamlessly with various preference optimization methods and achieves consistent improvements in mathematical reasoning tasks.

Plug-and-Play Training Framework for Preference Optimization

TL;DR

The paper addresses the limitation of uniform sample weighting in Preference Optimization (PO) methods when training LLMs for high-precision tasks like mathematical reasoning. It introduces a plug-and-play weighted training framework that uses multiple sampling to estimate output distributions, derives dynamic sample weights based on observed correctness and errors, and integrates these weights into pairwise PO objectives (e.g., DPO, DPOP, IPO, SimPO). The approach centers on data collection via -fold sampling, weight computation using statistics , , , and , and a Bradley–Terry–style training objective that emphasizes informative pairs. Experimental results on GSM8K and MATH show consistent improvements across multiple PO baselines and model series, with analyses linking stability and reward dynamics to the weighting strategy. The framework offers a practical, modular enhancement to RLHF-style alignment, particularly improving mathematical reasoning, while noting limitations related to defining equivalence classes for responses and suggesting future work on semantic clustering to generalize beyond clearly correct answers.

Abstract

Recently, preference optimization methods such as DPO have significantly enhanced large language models (LLMs) in wide tasks including dialogue and question-answering. However, current methods fail to account for the varying difficulty levels of training samples during preference optimization, leading to mediocre performance in tasks with high accuracy requirements, particularly in mathematical reasoning. To address this limitation, we propose a novel training framework, which employs multiple sampling to analyze output distributions, assign different weights to samples, and incorporate these weights into the preference optimization process. This plug-and-play approach enables LLMs to prioritize challenging examples during training, improving learning efficiency. Experimental results demonstrate that our framework integrates seamlessly with various preference optimization methods and achieves consistent improvements in mathematical reasoning tasks.
Paper Structure (25 sections, 7 equations, 9 figures, 2 tables)

This paper contains 25 sections, 7 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The figure illustrates the variability in the model's output when sampling the same question multiple times. In Question 1, the model consistently produces the correct answer across all samples. In contrast, Question 2 demonstrates a case where the model generates diverse responses, including incorrect answers, reflecting uncertainty or inconsistency in the model's reasoning process.
  • Figure 2: Overall process of the framework. In Step 1, we begin by sampling the model multiple times to collect the distribution of responses for each question. In step 2, we identify the responses with the highest number of incorrect answers as well as the correct answers. These are then weighted according to their frequency of occurrence. In step 3, various pairwise comparison alignment methods can be applied to incorporate these weights into the training process, ultimately resulting in a trained model.
  • Figure 3: The distribution of the responses of the model in the face of multiple samples of the same question. Each point represents a question. Here, the x-axis represents the number of different unique answers obtained in the sampling(which can be viewed as the number of answer equivalence classes), and the y-axis represents the proportion of correct answers among all 100 responds.
  • Figure 4: Results of the model evaluated with multiple samples. Specifically, we present the results for Qwen2-7B-Instruct, where it is evident that the weighted training method generally achieves higher correctness compared to the unweighted method.
  • Figure 5: Data point distribution before and after training on Qwen2-7B-Instruct. The x-axis represents the number of different unique answers obtained in the sampling(which can be viewed as the number of answer equivalence classes), and the y-axis represents the proportion of correct answers among all 100 responds. The data points shift upward toward the upper left, indicating that the outputs of the weighted-trained model are more stable and accurate across multiple samples.
  • ...and 4 more figures