Table of Contents
Fetching ...

Preference Learning with Response Time: Robust Losses and Guarantees

Ayush Sawarni, Sahasrajit Sarmasarkar, Vasilis Syrgkanis

TL;DR

The paper addresses learning reward models from human preferences by enriching binary choices with response-time data under the EZ-diffusion model. It introduces a Neyman-orthogonal loss that debiases nuisance components and achieves oracle-like convergence, extending from linear to nonparametric reward spaces. Theoretical results show exponential-to-polynomial improvements in estimation error for linear rewards and finite-sample guarantees for general function classes, complemented by comprehensive experiments on linear, nonlinear, and text-to-image preference tasks. These methods promise substantial gains in data efficiency for large-scale human-in-the-loop systems and offer avenues for extending to bandits and DPO-style policy learning.

Abstract

This paper investigates the integration of response time data into human preference learning frameworks for more effective reward model elicitation. While binary preference data has become fundamental in fine-tuning foundation models, generative AI systems, and other large-scale models, the valuable temporal information inherent in user decision-making remains largely unexploited. We propose novel methodologies to incorporate response time information alongside binary choice data, leveraging the Evidence Accumulation Drift Diffusion (EZ) model, under which response time is informative of the preference strength. We develop Neyman-orthogonal loss functions that achieve oracle convergence rates for reward model learning, matching the theoretical optimal rates that would be attained if the expected response times for each query were known a priori. Our theoretical analysis demonstrates that for linear reward functions, conventional preference learning suffers from error rates that scale exponentially with reward magnitude. In contrast, our response time-augmented approach reduces this to polynomial scaling, representing a significant improvement in sample efficiency. We extend these guarantees to non-parametric reward function spaces, establishing convergence properties for more complex, realistic reward models. Our extensive experiments validate our theoretical findings in the context of preference learning over images.

Preference Learning with Response Time: Robust Losses and Guarantees

TL;DR

The paper addresses learning reward models from human preferences by enriching binary choices with response-time data under the EZ-diffusion model. It introduces a Neyman-orthogonal loss that debiases nuisance components and achieves oracle-like convergence, extending from linear to nonparametric reward spaces. Theoretical results show exponential-to-polynomial improvements in estimation error for linear rewards and finite-sample guarantees for general function classes, complemented by comprehensive experiments on linear, nonlinear, and text-to-image preference tasks. These methods promise substantial gains in data efficiency for large-scale human-in-the-loop systems and offer avenues for extending to bandits and DPO-style policy learning.

Abstract

This paper investigates the integration of response time data into human preference learning frameworks for more effective reward model elicitation. While binary preference data has become fundamental in fine-tuning foundation models, generative AI systems, and other large-scale models, the valuable temporal information inherent in user decision-making remains largely unexploited. We propose novel methodologies to incorporate response time information alongside binary choice data, leveraging the Evidence Accumulation Drift Diffusion (EZ) model, under which response time is informative of the preference strength. We develop Neyman-orthogonal loss functions that achieve oracle convergence rates for reward model learning, matching the theoretical optimal rates that would be attained if the expected response times for each query were known a priori. Our theoretical analysis demonstrates that for linear reward functions, conventional preference learning suffers from error rates that scale exponentially with reward magnitude. In contrast, our response time-augmented approach reduces this to polynomial scaling, representing a significant improvement in sample efficiency. We extend these guarantees to non-parametric reward function spaces, establishing convergence properties for more complex, realistic reward models. Our extensive experiments validate our theoretical findings in the context of preference learning over images.

Paper Structure

This paper contains 53 sections, 21 theorems, 181 equations, 3 figures, 3 tables.

Key Result

Lemma 3.1

The population loss ${\mathcal{L}}^{\mathtt{ortho}}$ is Neyman-orthogonal with respect to nuisance $g \coloneqq \left( \mathfrak{r} , t \right)$ i.e. $D_g D_{r} {\mathcal{L}}^{\mathtt{ortho}} (r_o; g_o)[r - r_o, g- g_o] = 0 \quad \forall r \in {\mathcal{R}} \quad \forall g \in \mathcal{G}$.

Figures (3)

  • Figure 1: Performance of the linear‐reward model as the true parameter magnitude $\lVert\theta_{0}\rVert$ varies. Left: $d=5$; right: $d=10$.
  • Figure 2: Left: mean‑squared error ($\pm$ standard error); right: cumulative regret ($\pm$ standard error) over $M =3000$ new queries on randomly sampled non-linear (neural network) reward models, both plotted against training‑set size $N$.
  • Figure 3: Left: mean‑squared error ($\pm$ standard deviation); right: cumulative regret ($\pm$ standard deviation) over $M =10000$ new queries on the Pick‑a‑Pic text‑to‑image task, both plotted against training‑set size $N$.

Theorems & Definitions (61)

  • Lemma 3.1
  • proof
  • Theorem 4.1
  • Theorem 5.1
  • Remark 5.2
  • Corollary 5.2: Data‐splitting
  • Corollary 5.2: Data‐reuse
  • Lemma A.1
  • proof
  • Corollary A.2
  • ...and 51 more