Table of Contents
Fetching ...

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

Yifan Wang, Bolian Li, Junlin Wu, Zhaoxuan Tan, Zheli Liu, Ruqi Zhang, Ananth Grama, Qingkai Zeng

TL;DR

This work addresses the scarcity of explicit positive feedback in real-world LLM deployment by leveraging abundant DSAT signals. DRIFT anchors training on authentic DSAT negatives and continuously samples positives from the evolving policy, optimized with a Direct Preference Optimization loss. Empirical results on real-world WildFeedback and synthetic UltraFeedback show DRIFT surpassing SPIN and IterDPO across WildBench and AlpacaEval2, with larger gains at 14B scales and enhanced exploratory diversity. Theoretical analysis establishes non-vanishing training signals and guaranteed improvements in true utility, supporting DRIFT as a scalable, robust post-training recipe for real-world preference learning.

Abstract

Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce \textbf{DRIFT} (\textbf{D}issatisfaction-\textbf{R}efined \textbf{I}terative pre\textbf{F}erence \textbf{T}raining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world \textit{WildFeedback} datasets and synthetic \textit{UltraFeedback} datasets achieve up to +6.23\% (7B) / +7.61\% (14B) on WildBench Task Score and up to +8.95\% (7B) / +12.29\% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

DRIFT: Learning from Abundant User Dissatisfaction in Real-World Preference Learning

TL;DR

This work addresses the scarcity of explicit positive feedback in real-world LLM deployment by leveraging abundant DSAT signals. DRIFT anchors training on authentic DSAT negatives and continuously samples positives from the evolving policy, optimized with a Direct Preference Optimization loss. Empirical results on real-world WildFeedback and synthetic UltraFeedback show DRIFT surpassing SPIN and IterDPO across WildBench and AlpacaEval2, with larger gains at 14B scales and enhanced exploratory diversity. Theoretical analysis establishes non-vanishing training signals and guaranteed improvements in true utility, supporting DRIFT as a scalable, robust post-training recipe for real-world preference learning.

Abstract

Real-world large language model deployments (e.g., conversational AI systems, code generation assistants) naturally generate abundant implicit user dissatisfaction (DSAT) signals, as users iterate toward better answers through refinements, corrections, and expressed preferences, while explicit satisfaction (SAT) feedback is scarce. Existing preference learning approaches are poorly aligned with this data profile, as they rely on costly human annotations or assume plentiful positive responses. In this paper, we introduce \textbf{DRIFT} (\textbf{D}issatisfaction-\textbf{R}efined \textbf{I}terative pre\textbf{F}erence \textbf{T}raining), which anchors training on real-world DSAT signals and samples positives dynamically from the evolving policy. Empirically, DRIFT models trained on real-world \textit{WildFeedback} datasets and synthetic \textit{UltraFeedback} datasets achieve up to +6.23\% (7B) / +7.61\% (14B) on WildBench Task Score and up to +8.95\% (7B) / +12.29\% (14B) on AlpacaEval2 win rate over base models, outperforming strong baseline methods such as iterative DPO and SPIN. At larger scales, the improvements are particularly pronounced: 14B models trained with DRIFT surpass GPT-4o-mini on WildBench. Further analysis shows that DRIFT also preserves exploratory capacity, yielding more diverse high-reward solutions rather than collapsing to narrow subsets. Theoretically, we demonstrate that this design preserves preference margins and avoids the gradient degeneration. These results show that DRIFT is an effective and scalable recipe for real-world post-training that leverages the most abundant and informative signal. The code and data are available at https://github.com/cacayaya/DRIFT.git.

Paper Structure

This paper contains 31 sections, 4 theorems, 25 equations, 7 figures, 6 tables, 1 algorithm.

Key Result

Lemma 1

Let $E=\{\sigma(-s)\ge\tau\}$ with $\mathbb{P}(E)\ge p_0>0$ for some $\tau\in(0,\tfrac{1}{2}]$. If $\mathbb{E}\bigl[\|d_\theta\|\mid E\bigr]\ge \Delta_{\mathrm{cond}}>0$, then Proof. From Eq. eq:dpo-grad and $\sigma(-s)\ge 0$,

Figures (7)

  • Figure 1: Overview of user feedback signals and the DRIFT framework. Explicit feedback (left) is sparse and biased, as most users are passive consumers. In contrast, implicit feedback (middle) provides abundant and informative signals, where dissatisfaction (DSAT) is far more prevalent than satisfaction (SAT) (e.g., 12% vs 5% in the WildFeedback dataset). DRIFT (right) leverages these DSAT signals for preference learning, enabling our 14B model to surpass commercial models.
  • Figure 2: Comparison of high reward region coverage.
  • Figure 3: Example of response diversity and quality comparison via semantic clustering. Two central plots: Left is the UMAP scatter of all responses; Right is the reward-weighted topography showing the global high-reward region and the high-reward coverage of the three methods. DRIFT covers a substantially larger portion of the global high-reward region than SPIN or IterDPO and uniquely explores markdown formatting (yellow circle). Full prompt and responses are in Appendix \ref{['app:example']}
  • Figure 4: The top row shows DRIFT training dynamics for iteration 1 on Qwen2.5-14B-Instruct. The bottom row shows the training dynamics for iteration 2.
  • Figure 5: SPIN model response example.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Lemma 1: Expected gradient lower bound under local quality
  • Theorem 1: Expected improvement of $J$
  • Theorem : Restatement of Theorem \ref{['thm:one-step']}
  • proof
  • Proposition 1: Quantitative degeneration at a SPIN fixed point
  • proof