Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

Seongho Son; William Bankes; Sayak Ray Chowdhury; Brooks Paige; Ilija Bogunovic

Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

Seongho Son, William Bankes, Sayak Ray Chowdhury, Brooks Paige, Ilija Bogunovic

TL;DR

This work tackles the challenge of non-stationary human preferences in RLHF for large language models by introducing NS-DPO, a direct preference optimization method that employs a Dynamic Bradley-Terry model and a single discount parameter $\gamma$ to weight data by recency. The authors prove theoretical bounds showing how estimation error and regret depend on a drift budget $B_T$, and demonstrate that a specific choice $\gamma = 1 - (B_T/T)^{3/4}$ yields a drift-aware regret of $\tilde{O}(d\, B_T^{3/4}\, n^{-1/4})$, while recovering the stationary rate $O(n^{-1/2})$ as drift vanishes. Empirically, NS-DPO exhibits strong robustness to various drift types across multiple LLMs and datasets, outperforming stationary baselines in non-stationary settings and maintaining competitive performance when drift is absent. The approach is computationally lightweight, requiring only a single extra weight parameter, and has potential extensions to online and multi-time-step scenarios. Overall, NS-DPO advances robust preference optimization under temporal drift with solid theory and practical validation.

Abstract

Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.

Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

TL;DR

to weight data by recency. The authors prove theoretical bounds showing how estimation error and regret depend on a drift budget

, and demonstrate that a specific choice

yields a drift-aware regret of

, while recovering the stationary rate

as drift vanishes. Empirically, NS-DPO exhibits strong robustness to various drift types across multiple LLMs and datasets, outperforming stationary baselines in non-stationary settings and maintaining competitive performance when drift is absent. The approach is computationally lightweight, requiring only a single extra weight parameter, and has potential extensions to online and multi-time-step scenarios. Overall, NS-DPO advances robust preference optimization under temporal drift with solid theory and practical validation.

Abstract

Paper Structure (28 sections, 8 theorems, 112 equations, 10 figures, 1 table)

This paper contains 28 sections, 8 theorems, 112 equations, 10 figures, 1 table.

Introduction
Preliminaries
Learning Under Preference Drift
Theoretical Analysis of Offline Non-stationary DPO
Theoretical Results
Experiments
Experimental Setup
Experiment Results
Conclusion
Further Related Works
Analysis of NS-DPO Gradient
Further Experiment Details
Controlling the Strength of Preference Drift
Non-Stationary Preference Dataset Creation
The Two Countries (2C) Non-Stationary Global Opinions Dataset
...and 13 more sections

Key Result

Theorem 2

(Estimation error of ${\tilde{\theta}}_T$.) Let $\delta \in (0, 1], \lambda > 0, \tau > 0$. Let ${\hat{\theta}}_T$ denote the minimiser of the NS-DPO loss defined in eq: NS-DPO loss - offline analysis. Let ${\tilde{\theta}}_T \in \Theta$ denote the parameter obtained by performing the parameter proj where $C_1>0$ is a constant.

Figures (10)

Figure 1: Human preferences are dynamic and influenced by a variety of factors (e.g. environment change and societal influence). However, standard preference optimization approaches (e.g., DPO and IPO rafailov2024directazar2024general) do not account for this non-stationarity. In contrast, NS-DPO robustly learns on non-stationary data by using a Dynamic Bradley-Terry model, and adjusts the loss to discount older datapoints and concentrate learning on the latest data.
Figure 2: Experiment results conducted on UltraFeedback-RM dataset with preference drift.[Left] $\rho_\mathrm{diff}=0.7$. [Center Left] $\rho_\mathrm{diff}=0.9$. [Center Right] $\rho_\mathrm{diff}=0.95$. [Right] $\rho_\mathrm{diff}=1.0$. As $\rho_\mathrm{diff}$, the percentage of training datapoints with flipped preference increases, DPO fails to learn the preference distribution at $T=101$. Meanwhile, NS-DPO shows robust performance under various values of $\rho_\mathrm{diff}$, maintaining reward accuracies above 50%. As $t_\mathrm{cp}$, the change point of the reward model happens later in time, the gap between stationary approaches and NS-DPO gets larger. The experiments are run under a reward model shift from PairRM to ArmoRM. Llama-2-7b-chat-hf is used, and the training dataset consists of 100 time steps.
Figure 3: NS-DPO consistently outperforms DPO and IPO as the change point, $t_{cp}$ nears the present $T=101$ for varying strengths of preference shift on the TV-HH dataset using the Llama-2-7b-chat-hf model. [Left] $\rho_\mathrm{diff}=0.7$. [Middle] $\rho_\mathrm{diff}=0.8$. [Right] $\rho_\mathrm{diff}=0.9$. We note that as the value of $t_\mathrm{cp}$ increases, the performance difference between NS-DPO and the baselines increases. This is because as the change point moves closer to the present time step, the number of samples available from the updated preference distribution decreases. NS-DPO discounts samples with old preferences, focusing learning upon the small number of samples with up-to-date preference labels.
Figure 4: [Left | Middle] NS-DPO outperforms DPO as the change point, $t_{cp}$ nears the present time, $T=101$, for $\rho_\mathrm{diff}=0.7$ and $\rho_\mathrm{diff}=1.0$ respectively on the TV-HH dataset finetuned on the llama-3-1b-it model. [Right] NS-DPO outperforms DPO in settings where preference drift is gradual across multiple timesteps on the TV-HH dataset.
Figure 5: NS-DPO returns more aligned responses than DPO, according to the reward model at $T=101$, when sudden preference shift occurs at later change points. We finetune llama-3-1b-it on the TV-HH dataset across a range of change points and $\rho_\mathrm{diff}$, and record the mean and std of the win rate across 600 samples from the test split over 3 runs.
...and 5 more figures

Theorems & Definitions (9)

Remark 1
Theorem 2
Theorem 3
Corollary 4
Theorem \ref{theorem: estimation error - offline - uniform}
Theorem \ref{theorem: regret bound - offline - uniform}
Theorem \ref{theorem: regret bound - offline - uniform}
Lemma \ref{theorem: regret bound - offline - uniform}
Corollary \ref{theorem: regret bound - offline - uniform}

Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

TL;DR

Abstract

Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (9)