Table of Contents
Fetching ...

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

Biqing Qi, Pengfei Li, Fangyuan Li, Junqi Gao, Kaiyan Zhang, Bowen Zhou

TL;DR

This work tackles online human-preference alignment for large language models, addressing catastrophic forgetting in continual and cross-domain settings. It introduces Online Fast-Slow Chasing DPO (OFS-DPO), which employs two LoRA-based modules with different update speeds and a regularization term to coordinate them, plus Cross-domain COFS-DPO that linearly combines domain memories to preserve prior knowledge. The authors establish regret-based theoretical guarantees and demonstrate superior in-domain alignment and cross-domain memory retention compared to strong baselines. The methods are designed to be memory-efficient and readily applicable to streaming preference data, with practical impact for continual alignment across diverse tasks.

Abstract

Direct Preference Optimization (DPO) improves the alignment of large language models (LLMs) with human values by training directly on human preference datasets, eliminating the need for reward models. However, due to the presence of cross-domain human preferences, direct continual training can lead to catastrophic forgetting, limiting DPO's performance and efficiency. Inspired by intraspecific competition driving species evolution, we propose a Online Fast-Slow chasing DPO (OFS-DPO) for preference alignment, simulating competition through fast and slow chasing among models to facilitate rapid adaptation. Specifically, we first derive the regret upper bound for online learning, validating our motivation with a min-max optimization pattern. Based on this, we introduce two identical modules using Low-rank Adaptive (LoRA) with different optimization speeds to simulate intraspecific competition, and propose a new regularization term to guide their learning. To further mitigate catastrophic forgetting in cross-domain scenarios, we extend the OFS-DPO with LoRA modules combination strategy, resulting in the Cross domain Online Fast-Slow chasing DPO (COFS-DPO). This method leverages linear combinations of fast modules parameters from different task domains, fully utilizing historical information to achive continual value alignment. Experimental results show that OFS-DPO outperforms DPO in in-domain alignment, while COFS-DPO excels in cross-domain continual learning scenarios.

Online DPO: Online Direct Preference Optimization with Fast-Slow Chasing

TL;DR

This work tackles online human-preference alignment for large language models, addressing catastrophic forgetting in continual and cross-domain settings. It introduces Online Fast-Slow Chasing DPO (OFS-DPO), which employs two LoRA-based modules with different update speeds and a regularization term to coordinate them, plus Cross-domain COFS-DPO that linearly combines domain memories to preserve prior knowledge. The authors establish regret-based theoretical guarantees and demonstrate superior in-domain alignment and cross-domain memory retention compared to strong baselines. The methods are designed to be memory-efficient and readily applicable to streaming preference data, with practical impact for continual alignment across diverse tasks.

Abstract

Direct Preference Optimization (DPO) improves the alignment of large language models (LLMs) with human values by training directly on human preference datasets, eliminating the need for reward models. However, due to the presence of cross-domain human preferences, direct continual training can lead to catastrophic forgetting, limiting DPO's performance and efficiency. Inspired by intraspecific competition driving species evolution, we propose a Online Fast-Slow chasing DPO (OFS-DPO) for preference alignment, simulating competition through fast and slow chasing among models to facilitate rapid adaptation. Specifically, we first derive the regret upper bound for online learning, validating our motivation with a min-max optimization pattern. Based on this, we introduce two identical modules using Low-rank Adaptive (LoRA) with different optimization speeds to simulate intraspecific competition, and propose a new regularization term to guide their learning. To further mitigate catastrophic forgetting in cross-domain scenarios, we extend the OFS-DPO with LoRA modules combination strategy, resulting in the Cross domain Online Fast-Slow chasing DPO (COFS-DPO). This method leverages linear combinations of fast modules parameters from different task domains, fully utilizing historical information to achive continual value alignment. Experimental results show that OFS-DPO outperforms DPO in in-domain alignment, while COFS-DPO excels in cross-domain continual learning scenarios.
Paper Structure (28 sections, 7 theorems, 59 equations, 3 figures, 7 tables, 2 algorithms)

This paper contains 28 sections, 7 theorems, 59 equations, 3 figures, 7 tables, 2 algorithms.

Key Result

Lemma 3.2.1

In online learning methods, there exists a regret upper bound that includes a minimax term: where the first term is the regret against the best $h' \in \mathcal{H}'$ and $\mathcal{H}'$ is an infinite hypothesis class to approximate $\mathcal{H}$, so the second term captures how well $\mathcal{H}'$ approximates $\mathcal{H}$.

Figures (3)

  • Figure 1: The framework of the OFS-DPO. In the upper section, F-Module and S-Module dynamically adjust during training, while the reference model remains fixed. The lower section illustrates the framework of the original DPO.
  • Figure 2: The framework of the COFS-DPO. Instantiate the fast-slow modules with LoRAs separately in different task domains to obtain the optimal LoRA module in each domain. Subsequently, we seek the optimal linear combination $(\beta_1,\beta_2)$ across all task domains.
  • Figure 3: All ablation results are based on IMDB. From left to right: Win rates with different choices of the regularization coefficient $\alpha$; win rates comparing OFS-DPO and PPO under varying learning rate multipliers between fast-slow modules; the influence of batch size and the contrast period $k$ between fast-slow modules on win rates; and kernel density estimates of the loss gradients from the original DPO and the OFS-DPO during the training process.

Theorems & Definitions (14)

  • Definition 3.2.1
  • Lemma 3.2.1
  • Definition 3.3.1
  • Theorem 3.3.1
  • Proposition 3.3.1
  • Definition 3.4.1
  • Theorem 3.4.1
  • proof
  • Lemma A.1.1
  • proof
  • ...and 4 more