Risk-aware Direct Preference Optimization under Nested Risk Measure

Lijun Zhang; Lin Li; Yajie Qi; Huizhong Song; Yaodong Yang; Jun Wang; Wei Wei

Risk-aware Direct Preference Optimization under Nested Risk Measure

Lijun Zhang, Lin Li, Yajie Qi, Huizhong Song, Yaodong Yang, Jun Wang, Wei Wei

TL;DR

This work tackles the challenge of risk control in aligning autoregressive language models with human preferences by introducing Risk-aware Direct Preference Optimization (Ra-DPO). Ra-DPO integrates nested risk measures into a token-level Pb-MDP framework, deriving a risk-sensitive Bradley-Terry formulation and establishing equivalence to a Regret Preference Model to yield a tractable optimization objective. The approach combines a risk-aware advantage with a KL constraint, producing loss functions Ra-DPO_1 and a stabilizing Ra-DPO_2 that balance alignment performance and model drift. Empirical results on IMDb, Anthropic HH, and AlpacaEval demonstrate that Ra-DPO achieves competitive reward quality while significantly reducing sequential drift, illustrating practical benefits for reliable, human-aligned language generation. The work provides a open-source implementation and lays groundwork for further integrating risk-awareness into safe, scalable LLM alignment.

Abstract

When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model's intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy while suppressing the deviation between a trained model and the reference model using a sequential risk ratio, thereby enhancing the model's risk-awareness. Experimental results across three open-source datasets: IMDb Dataset, Anthropic HH Dataset, and AlpacaEval, demonstrate the proposed method's superior performance in balancing alignment performance and model drift. Our code is opensourced at https://github.com/zlj123-max/Ra-DPO.

Risk-aware Direct Preference Optimization under Nested Risk Measure

TL;DR

Abstract

Risk-aware Direct Preference Optimization under Nested Risk Measure

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (11)