Right Now, Wrong Then: Non-Stationary Direct Preference Optimization under Preference Drift
Seongho Son, William Bankes, Sayak Ray Chowdhury, Brooks Paige, Ilija Bogunovic
TL;DR
This work tackles the challenge of non-stationary human preferences in RLHF for large language models by introducing NS-DPO, a direct preference optimization method that employs a Dynamic Bradley-Terry model and a single discount parameter $\gamma$ to weight data by recency. The authors prove theoretical bounds showing how estimation error and regret depend on a drift budget $B_T$, and demonstrate that a specific choice $\gamma = 1 - (B_T/T)^{3/4}$ yields a drift-aware regret of $\tilde{O}(d\, B_T^{3/4}\, n^{-1/4})$, while recovering the stationary rate $O(n^{-1/2})$ as drift vanishes. Empirically, NS-DPO exhibits strong robustness to various drift types across multiple LLMs and datasets, outperforming stationary baselines in non-stationary settings and maintaining competitive performance when drift is absent. The approach is computationally lightweight, requiring only a single extra weight parameter, and has potential extensions to online and multi-time-step scenarios. Overall, NS-DPO advances robust preference optimization under temporal drift with solid theory and practical validation.
Abstract
Current Large Language Model (LLM) preference optimization algorithms do not account for temporal preference drift, which can lead to severe misalignment. To address this limitation, we propose Non-Stationary Direct Preference Optimisation (NS-DPO) that models time-dependent reward functions with a Dynamic Bradley-Terry model. NS-DPO proposes a computationally efficient solution by introducing only a single discount parameter in the loss function, which is used for exponential weighting that proportionally focuses learning on more time-relevant datapoints. We theoretically analyze the convergence of NS-DPO in a general setting where the exact nature of the preference drift is not known, providing upper bounds on the estimation error and regret caused by non-stationary preferences. Finally, we demonstrate the effectiveness of NS-DPO for fine-tuning LLMs under drifting preferences. Using scenarios where various levels of preference drift is introduced, with popular LLM reward models and datasets, we show that NS-DPO fine-tuned LLMs remain robust under non-stationarity, significantly outperforming baseline algorithms that ignore temporal preference changes, without sacrificing performance in stationary cases.
