Table of Contents
Fetching ...

Semi-Supervised Preference Optimization with Limited Feedback

Seonggyun Lee, Sungjun Lim, Seojin Park, Soeun Cheon, Kyungwoo Song

TL;DR

This paper tackles the data bottleneck in preference optimization by proposing SSPO, a semi-supervised framework that learns from a small set of labeled preferences alongside a large pool of unpaired data. It grounds pseudo-labeling in theory via a Bayes-risk-minimizing reward threshold and uses kernel density estimation to dynamically determine the threshold, coupled with an adaptive curriculum that shifts focus from labeled to pseudo-labeled data. Empirical results across toy and real-world datasets demonstrate strong data efficiency, robustness to noise, and superior performance over baselines, including domain-specific improvements. The work offers a scalable approach to aligning language models with human preferences while substantially lowering annotation costs, with broad implications for safe and reliable AI systems.

Abstract

The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Mistral-7B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.

Semi-Supervised Preference Optimization with Limited Feedback

TL;DR

This paper tackles the data bottleneck in preference optimization by proposing SSPO, a semi-supervised framework that learns from a small set of labeled preferences alongside a large pool of unpaired data. It grounds pseudo-labeling in theory via a Bayes-risk-minimizing reward threshold and uses kernel density estimation to dynamically determine the threshold, coupled with an adaptive curriculum that shifts focus from labeled to pseudo-labeled data. Empirical results across toy and real-world datasets demonstrate strong data efficiency, robustness to noise, and superior performance over baselines, including domain-specific improvements. The work offers a scalable approach to aligning language models with human preferences while substantially lowering annotation costs, with broad implications for safe and reliable AI systems.

Abstract

The field of preference optimization has made outstanding contributions to the alignment of language models with human preferences. Despite these advancements, recent methods still rely heavily on substantial paired (labeled) feedback data, leading to substantial resource expenditures. To address these challenges, we study the problem of Semi-Supervised Preference Optimization (SSPO) in which the idea is to learn from both a small number of pairwise preference labels and a large pool of unpaired samples simultaneously. Our key theoretical contribution proves the existence of an optimal reward threshold capable of separating winning and losing responses with high probability, which enables a principled pseudo-labeling of unpaired data. By leveraging these pseudo-labels, SSPO effectively distills latent preferences from large-scale unpaired data, thus maintaining human alignment while drastically reducing acquisition costs. Extensive experiments across datasets validate this remarkable data efficiency; for instance, SSPO trained with Mistral-7B-Instruct on just 1% of UltraFeedback consistently surpasses strong baselines trained on 10% of UltraFeedback.

Paper Structure

This paper contains 47 sections, 1 theorem, 31 equations, 3 figures, 18 tables, 1 algorithm.

Key Result

Theorem 1

(Existence of an Optimal Reward Threshold) Let us consider the i.i.d. samples of rewards from losing responses, $\{ r_\theta(x^{(i)}, y_l^{(i)})\}_{i=1}^{n_L}$, and from winning responses, $\{ r_\theta(x^{(j)}, y_w^{(j)})\}_{j=1}^{n_L}$. Assume both distributions are sub-Gaussian with means $\mu_l, for all $i,j \in \mathcal{I}(D_L)$, where $\mathcal{I}(D_L)$ denotes the index set of instances in

Figures (3)

  • Figure 1: Overview of the SSPO framework. Existing preference optimization methods, such as DPO and SimPO, rely solely on a limited number of human-labeled comparisons. These methods discard abundant unpaired responses (e.g., supervised fine-tuning data) due to the lack of preference labels, which hinders generalization and data efficiency. SSPO leverages a reward function trained on labeled comparisons to assign pseudo-labels to unpaired responses. Responses above a learned threshold are treated as (pseudo) winning, and those below as (pseudo) losing. Hence, the policy model optimizes the reward threshold using both labeled and pseudo-labeled data, thereby improving alignment quality and generalization beyond the labeled dataset.
  • Figure 2: [-15]Loss Contribution Ratio. (Mistral trained on 1% of UltraFeedback) This illustrates how the adaptive scheduler shifts the model’s learning focus from paired data (cyan) to pseudo-labeled unpaired data (red), enabling effective and robust learning.
  • Figure 3: Evolution of reward distributions and the Bayes-risk-minimizing threshold during SSPO training. We visualize the reward densities of winning (blue) and losing (orange) responses generated by each training step of Mistral trained with 10% of the DSP Business. The dashed green line indicates the estimated threshold $\hat{\delta}$. As training progresses, the separation between the two distributions becomes distinct, and the adaptive threshold tracks the optimal decision boundary.

Theorems & Definitions (1)

  • Theorem 1