When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Amirabbas Afzali; Myeongho Jeon; Maria Brbic

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Amirabbas Afzali, Myeongho Jeon, Maria Brbic

TL;DR

It is found that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations, and proposes Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives.

Abstract

Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

TL;DR

Abstract

Paper Structure (35 sections, 30 equations, 9 figures, 26 tables, 1 algorithm)

This paper contains 35 sections, 30 equations, 9 figures, 26 tables, 1 algorithm.

Introduction
Problem Statement and Preliminaries
Problem statement
Preliminaries
Confidence-Weighted Preference Optimization
Exploration on Weak LLM Confidence
Confidence-Weighted Preference Optimization
Experiments
Experimental Setup
Experimental Results
Analysis
Related Work
Concluding Remarks
The Use of Large Language Models (LLMs)
Details of Preference Optimization Loss Functions
...and 20 more sections

Figures (9)

Figure 1: Overall pipeline of our setting. Top: Conventional DPO rafailov2023direct. For each triplet consisting of a prompt $x$ and two candidate responses $(y_1, y_2)$, human annotators provide preference labels, and the policy model is aligned with these labels using DPO. Bottom: CW-DPO framework. A weak LLM is first trained as a preference annotator using a subset of human-labeled triplets. It is then applied to annotate the remaining large-scale data, which is subsequently trained with CW-DPO. The bars on top right report Gold Reward Accuracy for standard DPO with human-labeled data (red) and for CW-DPO (blue) on the Anthropic HH-RLHF. CW-DPO uses only $30\%$ compared to DPO, which uses fully human-annotated dataset. OPT-125M and OPT-1.3B are used as the weak and strong models, respectively.
Figure 2: Alignment with the top-N% most confident samples. Gold reward accuracy (GRA) is reported for the trained strong models. We consider (OPT-125M $\rightarrow$ OPT-1.3B) and (Qwen-0.5B $\rightarrow$ Qwen-7B) as weak–strong model pairs. The graph shows the average GRA for two models. Here, 100% denotes using the weak LLM directly for annotation. Further details of the results are provided in Appendix \ref{['app:further_results']}.
Figure 3: Left: GRA when adjusting the proportion of ${\cal D}_{\text{labeled}}$ used to fine-tune the weak LLM, while retaining 50% of the data as training for the strong LLM. Right: GRA across varying proportions of ${\cal D}_{\text{labeled}}$. As the split ratio decreases, the size of ${\cal D}_{\text{labeled}}$ decreases and ${\cal D}_{\text{unlabeled}}$ increases because the total dataset (${\cal D}_{\text{labeled}} \cup {\cal D}_{\text{unlabeled}}$) is fixed.
Figure 4: Alignment results across top-N% confidence thresholds.
Figure 5: Gold Reward gap plots demonstrated as $\mathrm{R}^*_{\text{CW-DPO}} - \mathrm{R}^*_{\text{GT}}$ for responses generated by Qwen2.5-7B models optimized with CW-DPO and standard DPO (using human-preference annotations). The Win-Rate (WR), defined as the fraction of samples for which CW-DPO achieves a higher reward than the Human-trained model.
...and 4 more figures

Theorems & Definitions (2)

Definition 1: Preference Data
Definition 2: Preference Optimization Objective

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

TL;DR

Abstract

When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Authors

TL;DR

Abstract

Table of Contents

Figures (9)

Theorems & Definitions (2)