Table of Contents
Fetching ...

Catoni Contextual Bandits are Robust to Heavy-tailed Rewards

Chenlu Ye, Yujia Jin, Alekh Agarwal, Tong Zhang

TL;DR

This paper develops an algorithmic approach building on Catoni's estimator from robust statistics, and applies it to contextual bandits with general function approximation and establishes a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range as well as the number of rounds.

Abstract

Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni's estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$ as well as the number of rounds $T$. For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.

Catoni Contextual Bandits are Robust to Heavy-tailed Rewards

TL;DR

This paper develops an algorithmic approach building on Catoni's estimator from robust statistics, and applies it to contextual bandits with general function approximation and establishes a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range as well as the number of rounds.

Abstract

Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range , and their regret scales polynomially with this reward range . However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni's estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range as well as the number of rounds . For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.

Paper Structure

This paper contains 47 sections, 30 theorems, 196 equations, 3 tables, 3 algorithms.

Key Result

Theorem 1

For any integer $T>0$, there exists a contextual bandit problem such that any $\pi=\{\pi_t\}_{t=1}^T$ will incur regret at least $\Omega(\sqrt{\mathbb{E}\sum_{t=1}^T\sigma_t^2})$, where $\{\sigma_t=\mathrm{Var}_{x_t\sim\pi_t}[y_t]\}_{t=1}^T$ and the expectation is jointly over any randomness in the

Theorems & Definitions (53)

  • Definition 1: $\upsilon$-cover and covering number
  • Definition 2: Eluder dimension gentile2022achieving
  • Theorem 1
  • Lemma 1: Informal
  • Remark 1
  • Theorem 2: Informal
  • Lemma 2
  • Lemma 3
  • Theorem 3: Informal
  • Lemma 4
  • ...and 43 more