Table of Contents
Fetching ...

Contextual Linear Bandits with Delay as Payoff

Mengxiao Zhang, Yingfei Wang, Haipeng Luo

TL;DR

This work extends the delay-as-payoff model to contextual linear bandits, addressing payoff-dependent observation delays in a practical, high-dimensional action space. It introduces a phased arm-elimination algorithm that relies on a volumetric spanner to construct confidence bounds from a small action subset, enabling robust learning without directly estimating the unknown parameter $\theta$. The authors establish problem-dependent regret bounds that separate delay overhead from the standard regret, and extend the results to time-varying action sets via a contextual reduction, including a misspecification-tolerant variant. Empirical results on synthetic data show the proposed approach outperforms LinUCB under delayed feedback and exhibits a phase-transition behavior as bad actions are pruned, highlighting its practical impact for delay-sensitive contextual decision problems.

Abstract

A recent work by Schlisselberg et al. (2024) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff itself. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is at most $DΔ_{\max}\log T$, where $T$ is the total horizon, $D$ is the maximum delay, and $Δ_{\max}$ is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2024). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking actions in a volumetric spanner of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.

Contextual Linear Bandits with Delay as Payoff

TL;DR

This work extends the delay-as-payoff model to contextual linear bandits, addressing payoff-dependent observation delays in a practical, high-dimensional action space. It introduces a phased arm-elimination algorithm that relies on a volumetric spanner to construct confidence bounds from a small action subset, enabling robust learning without directly estimating the unknown parameter . The authors establish problem-dependent regret bounds that separate delay overhead from the standard regret, and extend the results to time-varying action sets via a contextual reduction, including a misspecification-tolerant variant. Empirical results on synthetic data show the proposed approach outperforms LinUCB under delayed feedback and exhibits a phase-transition behavior as bad actions are pruned, highlighting its practical impact for delay-sensitive contextual decision problems.

Abstract

A recent work by Schlisselberg et al. (2024) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff itself. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is at most , where is the total horizon, is the maximum delay, and is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2024). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking actions in a volumetric spanner of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.

Paper Structure

This paper contains 22 sections, 15 theorems, 97 equations, 1 figure, 4 algorithms.

Key Result

Proposition 3.2

Given a finite set ${\mathcal{A}}$ of size $K$, there exists an efficient algorithm finding a volumetric spanner ${\mathcal{S}}$ of ${\mathcal{A}}$ with $|{\mathcal{S}}|=3n$ within $\mathcal{O}(Kn^3\log n)$ runtime.

Figures (1)

  • Figure 1: Comparison of the empirical results of our algorithm and LinUCB. The top row is the delay-as-loss setting and the bottom row is the delay-as-reward setting. The left, middle, and right column correspond to $n=6,8,10$ respectively.

Theorems & Definitions (28)

  • Definition 3.1: Volumetric Spanner hazan2016volumetric
  • Proposition 3.2: bhaskara2023tight
  • Theorem 3.3
  • Theorem 4.1
  • Theorem 1.1
  • Lemma 1.2
  • proof
  • Lemma 1.3
  • proof
  • Lemma 1.4
  • ...and 18 more