Table of Contents
Fetching ...

Risk-aware Direct Preference Optimization under Nested Risk Measure

Lijun Zhang, Lin Li, Yajie Qi, Huizhong Song, Yaodong Yang, Jun Wang, Wei Wei

TL;DR

This work tackles the challenge of risk control in aligning autoregressive language models with human preferences by introducing Risk-aware Direct Preference Optimization (Ra-DPO). Ra-DPO integrates nested risk measures into a token-level Pb-MDP framework, deriving a risk-sensitive Bradley-Terry formulation and establishing equivalence to a Regret Preference Model to yield a tractable optimization objective. The approach combines a risk-aware advantage with a KL constraint, producing loss functions Ra-DPO_1 and a stabilizing Ra-DPO_2 that balance alignment performance and model drift. Empirical results on IMDb, Anthropic HH, and AlpacaEval demonstrate that Ra-DPO achieves competitive reward quality while significantly reducing sequential drift, illustrating practical benefits for reliable, human-aligned language generation. The work provides a open-source implementation and lays groundwork for further integrating risk-awareness into safe, scalable LLM alignment.

Abstract

When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model's intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy while suppressing the deviation between a trained model and the reference model using a sequential risk ratio, thereby enhancing the model's risk-awareness. Experimental results across three open-source datasets: IMDb Dataset, Anthropic HH Dataset, and AlpacaEval, demonstrate the proposed method's superior performance in balancing alignment performance and model drift. Our code is opensourced at https://github.com/zlj123-max/Ra-DPO.

Risk-aware Direct Preference Optimization under Nested Risk Measure

TL;DR

This work tackles the challenge of risk control in aligning autoregressive language models with human preferences by introducing Risk-aware Direct Preference Optimization (Ra-DPO). Ra-DPO integrates nested risk measures into a token-level Pb-MDP framework, deriving a risk-sensitive Bradley-Terry formulation and establishing equivalence to a Regret Preference Model to yield a tractable optimization objective. The approach combines a risk-aware advantage with a KL constraint, producing loss functions Ra-DPO_1 and a stabilizing Ra-DPO_2 that balance alignment performance and model drift. Empirical results on IMDb, Anthropic HH, and AlpacaEval demonstrate that Ra-DPO achieves competitive reward quality while significantly reducing sequential drift, illustrating practical benefits for reliable, human-aligned language generation. The work provides a open-source implementation and lays groundwork for further integrating risk-awareness into safe, scalable LLM alignment.

Abstract

When fine-tuning pre-trained Large Language Models (LLMs) to align with human values and intentions, maximizing the estimated reward can lead to superior performance, but it also introduces potential risks due to deviations from the reference model's intended behavior. Most existing methods typically introduce KL divergence to constrain deviations between the trained model and the reference model; however, this may not be sufficient in certain applications that require tight risk control. In this paper, we introduce Risk-aware Direct Preference Optimization (Ra-DPO), a novel approach that incorporates risk-awareness by employing a class of nested risk measures. This approach formulates a constrained risk-aware advantage function maximization problem and then converts the Bradley-Terry model into a token-level representation. The objective function maximizes the likelihood of the policy while suppressing the deviation between a trained model and the reference model using a sequential risk ratio, thereby enhancing the model's risk-awareness. Experimental results across three open-source datasets: IMDb Dataset, Anthropic HH Dataset, and AlpacaEval, demonstrate the proposed method's superior performance in balancing alignment performance and model drift. Our code is opensourced at https://github.com/zlj123-max/Ra-DPO.

Paper Structure

This paper contains 37 sections, 5 theorems, 59 equations, 14 figures, 3 tables, 1 algorithm.

Key Result

Lemma 3.1

For a given Pb-MDP, the reward over the entire prompt-response can be decomposed as $r = \sum_{t=1}^T \gamma^{t-1} R\left(\left[x, y^{<t}\right], y^t\right)$, the relationship between the state value function Equation (Equation: Nested PbRL MDP) and Equation (Equation: New PbRL MDP) is as follows: $

Figures (14)

  • Figure 1: Comparison of loss functions for DPO, $\text{TDPO}_\text{2}$ and $\text{Ra-DPO}_\text{2}$ methods. The $\operatorname{sg}$ denotes the stop-gradient operator.
  • Figure 2: The experiment on the IMDb dataset with GPT-2 Large serving as the base model. (a) and (b) present the progression of sequential KL divergence (the lower the better) for both preferred and dispreferred responses. (c) illustrates the reward accuracy curves (the higher the better).
  • Figure 3: The experiment on the Anthropic HH dataset with Pythia-1.4B serving as the base model. Left and Middle present the progression of sequential KL divergence (the lower the better) for both preferred and dispreferred responses. Right illustrates reward accuracy curves (the higher the better).
  • Figure 4: The experiment on the Anthropic HH dataset with Pythia-1.4B serving as the base model. Left and Middle presents the sequential KL divergence (the lower the better) for preferred and dispreferred responses, while Right presents the reward accuracy curves (the higher the better) under $\alpha = \{0.3, 0.5, 0.7, 0.9\}$.
  • Figure 5: The experiment on the Anthropic HH dataset with Pythia-1.4B serving as the base model. Left and Middle present the progression of sequential KL divergence (the lower the better) for both preferred and dispreferred responses. Right illustrates reward accuracy curves (the higher the better).
  • ...and 9 more figures

Theorems & Definitions (11)

  • Lemma 3.1
  • Definition 3.2
  • Lemma 3.3
  • Lemma 3.4
  • Lemma 3.5
  • Theorem 3.6
  • proof
  • proof
  • proof
  • proof
  • ...and 1 more