Enhancing Preference-based Linear Bandits via Human Response Time

Shen Li; Yuyang Zhang; Zhaolin Ren; Claire Liang; Na Li; Julie A. Shah

Enhancing Preference-based Linear Bandits via Human Response Time

Shen Li, Yuyang Zhang, Zhaolin Ren, Claire Liang, Na Li, Julie A. Shah

TL;DR

This work addresses learning human preferences from binary choices by enriching the signal with response times, using a difference-based EZ-diffusion model within a linear utility framework. It introduces a choice-decision-time (CH,DT) estimator that combines choices and decision times to estimate $\theta^*/a$ efficiently, and derives both asymptotic and non-asymptotic guarantees showing stronger information gain for queries with large utility differences. The CH,DT estimator is integrated into the Generalized Successive Elimination (GSE) algorithm for fixed-budget best-arm identification, and empirical results on synthetic data and three real-world datasets demonstrate faster and more reliable preference learning than choice-only approaches. The findings suggest that response-time information can substantially accelerate interactive preference learning in practical systems, with limitations and future work focusing on data reliability and estimating non-decision times directly from data.

Abstract

Interactive preference learning systems infer human preferences by presenting queries as pairs of options and collecting binary choices. Although binary choices are simple and widely used, they provide limited information about preference strength. To address this, we leverage human response times, which are inversely related to preference strength, as an additional signal. We propose a computationally efficient method that combines choices and response times to estimate human utility functions, grounded in the EZ diffusion model from psychology. Theoretical and empirical analyses show that for queries with strong preferences, response times complement choices by providing extra information about preference strength, leading to significantly improved utility estimation. We incorporate this estimator into preference-based linear bandits for fixed-budget best-arm identification. Simulations on three real-world datasets demonstrate that using response times significantly accelerates preference learning compared to choice-only approaches. Additional materials, such as code, slides, and talk video, are available at https://shenlirobot.github.io/pages/NeurIPS24.html

Enhancing Preference-based Linear Bandits via Human Response Time

TL;DR

efficiently, and derives both asymptotic and non-asymptotic guarantees showing stronger information gain for queries with large utility differences. The CH,DT estimator is integrated into the Generalized Successive Elimination (GSE) algorithm for fixed-budget best-arm identification, and empirical results on synthetic data and three real-world datasets demonstrate faster and more reliable preference learning than choice-only approaches. The findings suggest that response-time information can substantially accelerate interactive preference learning in practical systems, with limitations and future work focusing on data reliability and estimating non-decision times directly from data.

Abstract

Paper Structure (32 sections, 9 theorems, 55 equations, 9 figures, 1 algorithm)

This paper contains 32 sections, 9 theorems, 55 equations, 9 figures, 1 algorithm.

Introduction
Problem setting and preliminaries
Preference-based bandits with a linear utility function.
Learning objective: Best-arm identification with a fixed budget.
Utility estimation
Choice-decision-time estimator and choice-only estimator
Asymptotic normality of the two estimators
Non-asymptotic concentration of the two estimators for utility difference estimation
Interactive learning algorithm
Empirical results
Estimation performance on synthetic data
Fixed-budget best-arm identification performance on real datasets
Conclusion and future work
Broader impacts
Literature review
...and 17 more sections

Key Result

Theorem 3.1

Given a fixed i.i.d. dataset $\left\{x,c_{x,s_{x,i}},t_{x,s_{x,i}}\right\}_{i\in[n]}$ for each $x\in\calX_{\text{sample}}$, where $\sum_{x\in\calX_{\text{sample}}}xx\t\succ 0$, and assuming that the datasets for different $x\in\calX_{\text{sample}}$ are independent, then, for any vector $y\in\bbR^d$ Here, the asymptotic variance depends on a problem-specific constant, $\zeta^2$, with an upper boun

Figures (9)

Figure 1: (a) depicts the human decision-making process for a binary query $x \in \mathcal{X}$, where the human selects between two arms. The human first spends a fixed non-decision time $t_{\text{nondec}}$ encoding the query. Then, the human's evidence accumulates according to a Brownian motion with drift $x^\top \theta^*$. When the evidence reaches the upper barrier $a$ or lower barrier $-a$, the human makes a choice, denoted by $c_x = 1$ or $c_x = -1$, respectively. The random stopping time of the accumulation process is the decision time $t_x$, and the total response time is $t_{\text{RT},x}=t_{\text{nondec}}+t_x$. (b) and (c) plot the expected choice $\bbE[c_x]$ and the expected decision time $\bbE[t_x]$, with shaded regions representing one standard deviation, plotted as functions of the utility difference $x^\top \theta^*$ for two barrier values $a$.
Figure 2: This figure illustrates key terms from our theoretical analyses, highlighting the different contributions of choices and decision times to utility estimation. These terms are functions of the utility difference $x^\top\theta^*$ and are plotted for two barrier values, $a$. (a) compares the weights $\bbE\qb{t_x}$ and $a^2\,\bbV\qb{c_x}$ in the asymptotic variances for the choice-decision-time estimator (orange, \ref{['thm:estimation:asymptotic:LM']}) and the choice-only estimator (gray, \ref{['thm:estimation:asymptotic:GLM']}), respectively. This comparison shows that decision times complement choices, particularly for queries with strong preferences. (b) compares the weights in the non-asymptotic concentration bounds (\ref{['thm:estimation:nonasymptotic:LM', 'thm:estimation:nonasymptotic:GLM']}), showing similar trends, though these weights may not be optimal due to proof techniques.
Figure 3: Three heatmaps show estimation error probabilities, $\mathbb{P}[\mathop{\mathrm{arg\,max}}\limits_{z\in\mathcal{Z}} z^\top \widehat{\theta}\neq z^*]$, for three GSE variations, shown as functions of the arm scaling factor $c_{\mathcal{Z}}$ and barrier $a$. Darker colors indicate better estimation. (a) The choice-only estimator $\widehat{\theta}_{\text{CH}}$ with the transductive design $\lambda_{\text{trans}}$ struggles as $c_{\mathcal{Z}}$ increases (i.e., preferences become stronger), highlighting that choices from queries with strong preferences provide limited information. (b) The weak-preference design $\lambda_{\text{weak}}$ improves (a) by sampling queries with weak preferences but assumes perfect knowledge of $\theta^*$ and equal resource consumption across queries. (c) The choice-decision-time estimator $\widehat{\theta}_{\text{CH,DT}}$ with $\lambda_{\text{trans}}$ outperforms both choice-only methods in (a) and (b), showing that decision times complement choices and improve estimation, especially for strong preferences.
Figure 4: This figure shows violin plots (with overlaid box plots) for datasets (a), (b), and (c), showing the distribution of best-arm identification error probabilities, $\mathbb{P}\left[\widehat{z}\neq z^*\right]$, for all bandit instances across six GSE variations and two budgets. The box plots follow the convention of https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html. For each GSE variation and budget, the horizontal line in the middle of the box represents the median of the error probabilities across all bandit instances. Each error probability is averaged over $300$ repeated simulations under different random seeds. The box's upper and lower borders represent the third and first quartiles, respectively, with whiskers extending to the farthest points within $1.5 \times$ the interquartile range. Flier points indicate outliers beyond the whiskers.
Figure 5: A violin plot overlaid with a box plot showing the best-arm identification error probability, $\mathbb{P}\left[\widehat{z}\neq z^*\right]$, as a function of budget for each GSE variation, simulated using the food-risk dataset with choices (-1 or 1) smith2018attention, as described in \ref{['app:sec:exp:foodrisk']}. The box plots follow the convention of https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html. For each GSE variation and budget, the horizontal line in the middle of the box represents the median of the error probabilities across all bandit instances. Each error probability is averaged over $300$ repeated simulations under different random seeds. The box's upper and lower borders represent the third and first quartiles, respectively, with whiskers extending to the farthest points within $1.5 \times$ the interquartile range. Flier points indicate outliers beyond the whiskers.
...and 4 more figures

Theorems & Definitions (14)

Theorem 3.1: Asymptotic normality of $\widehat{\theta}_{\text{CH,DT}}$
Theorem 3.2: Asymptotic normality of $\widehat{\theta}_{\text{CH}}$
Theorem 3.3: Non-asymptotic concentration of $\widehat{u}_{x,\text{CH,DT}}$
Theorem 3.4: Non-asymptotic concentration of $\widehat{u}_{x,\text{CH}}$
Theorem C.1: Asymptotic normality of $\widehat{\theta}_{\text{CH,DT}}$
proof
Lemma C.1
proof
Theorem C.1: Non-asymptotic concentration of $\widehat{u}_{x,\text{CH,DT}}$
proof
...and 4 more

Enhancing Preference-based Linear Bandits via Human Response Time

TL;DR

Abstract

Enhancing Preference-based Linear Bandits via Human Response Time

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (14)