Table of Contents
Fetching ...

Humanline: Online Alignment as Perceptual Loss

Sijia Liu, Niklas Muennighoff, Kawin Ethayarajh

Abstract

Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO) -- but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping -- originally introduced to just stabilize training -- recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating humanline variants of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts (on both verifiable and unverifiable tasks) while running up to 6x faster.

Humanline: Online Alignment as Perceptual Loss

Abstract

Online alignment (e.g., GRPO) is generally more performant than offline alignment (e.g., DPO) -- but why? Drawing on prospect theory from behavioral economics, we propose a human-centric explanation. We prove that online on-policy sampling better approximates the human-perceived distribution of what the model can produce, and PPO/GRPO-style clipping -- originally introduced to just stabilize training -- recovers a perceptual bias in how humans perceive probability. In this sense, PPO/GRPO act as perceptual losses already. Our theory further suggests that the online/offline dichotomy is itself incidental to maximizing human utility, since we can achieve the same effect by selectively training on any data in a manner that mimics human perception, rather than restricting ourselves to online on-policy data. Doing so would allow us to post-train more quickly, cheaply, and flexibly without sacrificing performance. To this end, we propose a design pattern that explicitly incorporates perceptual distortions of probability into objectives like DPO/KTO/GRPO, creating humanline variants of them. Surprisingly, we find that these humanline variants, even when trained with offline off-policy data, can match the performance of their online counterparts (on both verifiable and unverifiable tasks) while running up to 6x faster.

Paper Structure

This paper contains 32 sections, 3 theorems, 20 equations, 11 figures, 7 tables, 2 algorithms.

Key Result

Proposition 3.4

For any input $x$ and bounded value function $v$, let the outcome of an output $y$ be its surprisal $\log [\pi_\theta(y|x)/\pi_\text{ref}(y|x)]$ and $Q$ be a candidate distribution over outcomes. Then to guarantee $| u(Z;\omega) - u(Z;Q) | \leq \delta$ for some $\delta \geq 0$, it suffices that $\sq

Figures (11)

  • Figure 1: On instruction-following, Llama3-8B-Instruct aligned with online on-policy data (blue) is 1.3x to 1.6x better than one aligned with offline off-policy data (red). However, when the same offline data is fed to the humanline variant of the objective (orange), the gap vanishes.
  • Figure 2: To estimate human utility, outputs should be sampled from the typical human-perceived distribution of what the policy can produce, whose inverted S-shape comes from prospect theory. Online on-policy sampling (dashed black) is superior to offline off-policy---both from worse (red) and better (blue) models---because the latter deviate more from human perception (solid black). Rejection-sampling with perceptual bias gives us humanline sampling (green) that can mimic this, and a special case of it simplifies to the humanline clipping used in our design pattern.
  • Figure 3: In offline objectives (left), the reference model does not change during training. In online objectives (middle), the reference is synced with the policy at the current step; at scale, some asynchrony is permitted (a lag of one step is depicted here). In humanline syncing (right), every $k$ steps, the reference is synced with the policy from the previous step ($k = 1$ is depicted here).
  • Figure 4: In humanline clipping, the token-wise likelihood ratios $r_\theta(i,t)$ are asymmetrically clipped to $[\epsilon_P, \epsilon_R]$ upstream of the loss. In the humanline variant of GRPO, instead of there being an unclipped $r_\theta$ and a $[1 - \epsilon, 1 + \epsilon]$-clipped $r_\theta$ as in (\ref{['eq:grpo']}), we have a once-clipped and twice-clipped $r_\theta$. Though humanline clipping should in theory be most impactful for losses without any clipping to begin with (e.g., DPO, KTO), it still benefits GRPO (see Figure \ref{['fig:ablations']}, left).
  • Figure 5: The majority of the improvement comes from humanline syncing (left). However, humanline clipping is still necessary---syncing alone is not competitive with online alignment. Although humanline clipping is a special case of the more general humanline sampling (§ \ref{['sec:clipping']}), it performs as well while being stabler and simpler to implement (right).
  • ...and 6 more figures

Theorems & Definitions (10)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Proposition 3.4
  • Proposition 4.1
  • Definition 4.2
  • Theorem 4.3
  • proof
  • proof
  • proof