Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

Avelina Asada Hadji-Kyriacou; Ognjen Arandjelovic

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

Avelina Asada Hadji-Kyriacou, Ognjen Arandjelovic

TL;DR

This work introduces Direct Preference Heads (DPH), an inference-time, auxiliary reward mechanism that guides candidate selection without altering a language model's logits. By defining Separable and Contrastive DPH losses and linking them to Conservative Direct Preference Optimization, the approach achieves robust learning of human-preference signals with reduced risk to reasoning, particularly for smaller models. Comprehensive evaluation across GLUE, GPT4All-style commonsense benchmarks, and RACE demonstrates that DPH can outperform supervised fine-tuning and Direct Preference Optimization on multiple tasks, while maintaining stability and efficiency through a three-stage training pipeline and priors regularization. The work discusses limitations and safety considerations, and outlines future work to extract richer signal categories (e.g., helpfulness, toxicity) and reinforce safety through additional guardrails.

Abstract

Pre-trained Language Models (LMs) exhibit strong zero-shot and in-context learning capabilities; however, their behaviors are often difficult to control. By utilizing Reinforcement Learning from Human Feedback (RLHF), it is possible to fine-tune unsupervised LMs to follow instructions and produce outputs that reflect human preferences. Despite its benefits, RLHF has been shown to potentially harm a language model's reasoning capabilities and introduce artifacts such as hallucinations where the model may fabricate facts. To address this issue we introduce Direct Preference Heads (DPH), a fine-tuning framework that enables LMs to learn human preference signals through an auxiliary reward head without directly affecting the output distribution of the language modeling head. We perform a theoretical analysis of our objective function and find strong ties to Conservative Direct Preference Optimization (cDPO). Finally we evaluate our models on GLUE, RACE, and the GPT4All evaluation suite and demonstrate that our method produces models which achieve higher scores than those fine-tuned with Supervised Fine-Tuning (SFT) or Direct Preference Optimization (DPO) alone.

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

TL;DR

Abstract

Paper Structure (45 sections, 2 theorems, 12 equations, 1 figure, 6 tables, 1 algorithm)

This paper contains 45 sections, 2 theorems, 12 equations, 1 figure, 6 tables, 1 algorithm.

Introduction
Prior Approaches
Reinforcement Learning from Human Feedback (RLHF)
Direct Preference Optimization (DPO)
Direct Preference Heads
Reward Head
Objective Function
Separable DPH
Contrastive DPH
Relation to cDPO
Novelty over Traditional Reward Modelling
Experimental Setup and Data
Datasets
Prompts and Sampling
Regularization
...and 30 more sections

Key Result

Theorem 1

For all $\epsilon \in (0,0.5]$ the objective function $\mathcal{L}_\text{SepDPH}$ is convex and will optimize the policy $\pi_\theta$ such that the preferred rewards $r_w$ produced by the preference head converge towards $\log\tfrac{1-\epsilon}{\epsilon}$ and the dispreferred rewards $r_l$ converge

Figures (1)

Figure 1: The loss landscapes of the DPH loss functions. The red and green points represent the rewards assigned to preferred and dispreferred answers, the vertical lines represent the direction and magnitude of reward gradients, and the blue area represents the optimal margin parameterised by $\epsilon$.

Theorems & Definitions (2)

Theorem 1
Theorem 2

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

TL;DR

Abstract

Would I Lie To You? Inference Time Alignment of Language Models using Direct Preference Heads

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (2)