Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits

Qiwei Di; Tao Jin; Yue Wu; Heyang Zhao; Farzad Farnoud; Quanquan Gu

Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits

Qiwei Di, Tao Jin, Yue Wu, Heyang Zhao, Farzad Farnoud, Quanquan Gu

TL;DR

This paper proposes a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound and performs empirical experiments on synthetic data to confirm the advantage of the method over previous variance-agnostic algorithms.

Abstract

Dueling bandits is a prominent framework for decision-making involving preferential feedback, a valuable feature that fits various applications involving human interaction, such as ranking, information retrieval, and recommendation systems. While substantial efforts have been made to minimize the cumulative regret in dueling bandits, a notable gap in the current research is the absence of regret bounds that account for the inherent uncertainty in pairwise comparisons between the dueling arms. Intuitively, greater uncertainty suggests a higher level of difficulty in the problem. To bridge this gap, this paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM). We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $\tilde O\big(d\sqrt{\sum_{t=1}^Tσ_t^2} + d\big)$, where $σ_t$ is the variance of the pairwise comparison in round $t$, $d$ is the dimension of the context vectors, and $T$ is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $\tilde O(d)$ regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.

Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits

TL;DR

Abstract

, where

is the variance of the pairwise comparison in round

is the dimension of the context vectors, and

is the time horizon. Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an

regret. We perform empirical experiments on synthetic data to confirm the advantage of our method over previous variance-agnostic algorithms.

Paper Structure (32 sections, 12 theorems, 89 equations, 2 figures, 1 algorithm)

This paper contains 32 sections, 12 theorems, 89 equations, 2 figures, 1 algorithm.

Introduction
Notation
Related Work
Problem Setup
Algorithm
Overview of the Algorithm
Regularized MLE
Multi-layer Structure with Variance-Aware Confidence Radius
Symmetric Arm Selection
Main Results
Variance-aware Regret Bound
Proof Sketch of Theorem \ref{['main theorem']}
Experiments
Experiment Setup.
Conclusion
...and 17 more sections

Key Result

Theorem 5.1

If we set $\alpha = 1/(T^{3/2})$, then with probability at least $1-2\delta$, the regret of Algorithm main algo is bounded as

Figures (2)

Figure 1: Experiments showing regret performance in various settings.
Figure 2: Regret comparison between VACDB and MaxInP on a real-world dataset.

Theorems & Definitions (16)

Theorem 5.1
Remark 5.2
Remark 5.3
Remark 5.4
Remark 5.5
Lemma 5.6
Theorem D.1: brouwer1911beweis
Lemma E.1
Lemma E.2
Lemma E.3
...and 6 more

Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits

TL;DR

Abstract

Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (16)