Table of Contents
Fetching ...

Aligning Language Models from User Interactions

Thomas Kleine Buening, Jonas Hübotter, Barna Pásztor, Idan Shenfeld, Giorgia Ramponi, Andreas Krause

Abstract

Multi-turn user interactions are among the most abundant data produced by language models, yet we lack effective methods to learn from them. While typically discarded, these interactions often contain useful information: follow-up user messages may indicate that a response was incorrect, failed to follow an instruction, or did not align with the user's preferences. Importantly, language models are already able to make use of this information in context. After observing a user's follow-up, the same model is often able to revise its behavior. We leverage this ability to propose a principled and scalable method for learning directly from user interactions through self-distillation. By conditioning the model on the user's follow-up message and comparing the resulting token distribution with the original policy, we obtain a target for updating the policy that captures how the model's behavior changes in hindsight. We then distill this hindsight distribution back into the current policy. Remarkably, we show that training on real-world user conversations from WildChat improves language models across standard alignment and instruction-following benchmarks, without regressing other capabilities. The same mechanism enables personalization, allowing models to continually adapt to individual users through interaction without explicit feedback. Our results demonstrate that raw user interactions that arise naturally during deployment enable alignment, personalization, and continual adaptation.

Aligning Language Models from User Interactions

Abstract

Multi-turn user interactions are among the most abundant data produced by language models, yet we lack effective methods to learn from them. While typically discarded, these interactions often contain useful information: follow-up user messages may indicate that a response was incorrect, failed to follow an instruction, or did not align with the user's preferences. Importantly, language models are already able to make use of this information in context. After observing a user's follow-up, the same model is often able to revise its behavior. We leverage this ability to propose a principled and scalable method for learning directly from user interactions through self-distillation. By conditioning the model on the user's follow-up message and comparing the resulting token distribution with the original policy, we obtain a target for updating the policy that captures how the model's behavior changes in hindsight. We then distill this hindsight distribution back into the current policy. Remarkably, we show that training on real-world user conversations from WildChat improves language models across standard alignment and instruction-following benchmarks, without regressing other capabilities. The same mechanism enables personalization, allowing models to continually adapt to individual users through interaction without explicit feedback. Our results demonstrate that raw user interactions that arise naturally during deployment enable alignment, personalization, and continual adaptation.
Paper Structure (31 sections, 2 theorems, 16 equations, 9 figures, 6 tables, 1 algorithm)

This paper contains 31 sections, 2 theorems, 16 equations, 9 figures, 6 tables, 1 algorithm.

Key Result

Proposition 1

Under idealized assumptions on user responses and model conditioning, the sequence-level self-distillation advantage satisfies where $Z(x,y)$ is a normalization term. In other words, under idealized assumptions, SDPO can be interpreted as implicitly maximizing the interacting user’s latent reward function.

Figures (9)

  • Figure 1: Direct Learning from User Interactions via Self-Distillation. From multi-turn user conversations, we obtain several interactions $(x, y, o)$ that consist of the conversation history $x$, the model's response $y$, and the subsequent user message $o$. By conditioning on the user's follow-up, we form the hindsight policy and compare it to the original policy, producing token-level advantages that reinforce or penalize parts of the model's original response. In this example, the user's follow-up requests a more direct answer, leading to penalizing filler tokens and reinforcing the answer.
  • Figure 2: Example of the token-level advantages \ref{['eq:logratio']} where the user complains with $o$ = "I said YES or NO only" after the assistant failed to follow the instruction.
  • Figure 3: Training on real-world user conversations, SDPO improves general alignment and instruction-following performance across benchmarks, without degrading other capabilities. Results for Qwen3-8B before and after training on 14,000 real-world user conversations.
  • Figure 4: SDPO adapts online to changing user preferences. The user’s preference about how the model should respond is flipped to its opposite after the first 250 interactions. SDPO with Qwen3-4B is able to quickly reverse the learned behavior.
  • Figure 5: SDPO rapidly personalizes to individual users from interaction alone. Win rate of SDPO against its base model (Qwen3-4B) for a user that prefers concise, casual, and beginner-friendly model responses.
  • ...and 4 more figures

Theorems & Definitions (3)

  • Proposition 1: Informal, \ref{['section:latent_reward_SDPO']}
  • Lemma 1
  • proof : Proof of \ref{['lemma:gradient_equivalence']}