Table of Contents
Fetching ...

Neural Contextual Bandits Under Delayed Feedback Constraints

Mohammadali Moghimi, Sharu Theresa Jose, Shana Moothedath

TL;DR

The paper tackles neural contextual bandits under delayed reward feedback, where rewards are revealed after random delays. It introduces Delayed NeuralUCB and Delayed NeuralTS to robustly explore and learn in this setting, leveraging neural tangent kernel theory to derive a high-probability regret bound that scales with the effective NTK dimension $\tilde{d}$ and a delay-dependent term $D_+$. The bound shows that delays increase regret through $D_+$ but that the dependence on horizon $T$ remains sublinear, with optimization errors controlled by a tunable parameter $J$. Empirical results on MNIST and Mushroom demonstrate that the proposed methods effectively handle various delay distributions, with neural approaches outperforming linear baselines in complex, high-dimensional settings.

Abstract

This paper presents a new algorithm for neural contextual bandits (CBs) that addresses the challenge of delayed reward feedback, where the reward for a chosen action is revealed after a random, unknown delay. This scenario is common in applications such as online recommendation systems and clinical trials, where reward feedback is delayed because the outcomes or results of a user's actions (such as recommendations or treatment responses) take time to manifest and be measured. The proposed algorithm, called Delayed NeuralUCB, uses an upper confidence bound (UCB)-based exploration strategy. Under the assumption of independent and identically distributed sub-exponential reward delays, we derive an upper bound on the cumulative regret over a T-length horizon. We further consider a variant of the algorithm, called Delayed NeuralTS, that uses Thompson Sampling-based exploration. Numerical experiments on real-world datasets, such as MNIST and Mushroom, along with comparisons to benchmark approaches, demonstrate that the proposed algorithms effectively manage varying delays and are well-suited for complex real-world scenarios.

Neural Contextual Bandits Under Delayed Feedback Constraints

TL;DR

The paper tackles neural contextual bandits under delayed reward feedback, where rewards are revealed after random delays. It introduces Delayed NeuralUCB and Delayed NeuralTS to robustly explore and learn in this setting, leveraging neural tangent kernel theory to derive a high-probability regret bound that scales with the effective NTK dimension and a delay-dependent term . The bound shows that delays increase regret through but that the dependence on horizon remains sublinear, with optimization errors controlled by a tunable parameter . Empirical results on MNIST and Mushroom demonstrate that the proposed methods effectively handle various delay distributions, with neural approaches outperforming linear baselines in complex, high-dimensional settings.

Abstract

This paper presents a new algorithm for neural contextual bandits (CBs) that addresses the challenge of delayed reward feedback, where the reward for a chosen action is revealed after a random, unknown delay. This scenario is common in applications such as online recommendation systems and clinical trials, where reward feedback is delayed because the outcomes or results of a user's actions (such as recommendations or treatment responses) take time to manifest and be measured. The proposed algorithm, called Delayed NeuralUCB, uses an upper confidence bound (UCB)-based exploration strategy. Under the assumption of independent and identically distributed sub-exponential reward delays, we derive an upper bound on the cumulative regret over a T-length horizon. We further consider a variant of the algorithm, called Delayed NeuralTS, that uses Thompson Sampling-based exploration. Numerical experiments on real-world datasets, such as MNIST and Mushroom, along with comparisons to benchmark approaches, demonstrate that the proposed algorithms effectively manage varying delays and are well-suited for complex real-world scenarios.

Paper Structure

This paper contains 14 sections, 5 theorems, 34 equations, 2 figures, 2 algorithms.

Key Result

Theorem 1

Let $\tilde{d}$ be the effective dimension, and $\mathbf{h}=[h(\mathbf{x}_i)]_{i=1}^{TK}$. Let $\mathbb{E}[\tau]$ denote the mean of the delay distribution that satisfies Assumption assum:delays. Under Assumption assum:1 and Assumption assum:3, there exists constants $C_1,C_4>0$ such that for any $\ $\lambda \geq \max\{1,S^{-2}, \mathcal{L}^2\}$, $\mathcal{L} \geq \Vert g(\mathbf{x};\boldsymbol{\t

Figures (2)

  • Figure 1: Comparison of the cumulative regret of the algorithms with no delayed feedback -- LinTS/UCB, Neural-TS/UCB -- with our proposed algorithms Delayed Neural-UCB/TS under uniform delay with $\mathbb{E}[\tau]=30$, as a function of the number of iterations on (left) MNIST and (right) Mushroom datasets. We run 5 experiments and plot the mean regret.
  • Figure 2: (Left) Comparison of the cumulative regret of the Delayed NeuralTS under uniform, exponential and Pareto delays with $\mathbb{E}[\tau]=30$ as a function of the number of iterations on MNIST. (Right) Comparison of the cumulative regret of the Delayed NeuralUCB under the three delays on Mushroom.

Theorems & Definitions (9)

  • Remark 1
  • Definition 1: jacot2018neural
  • Theorem 1
  • Lemma 1: zhou2020neural
  • Lemma 2
  • proof
  • Lemma 3: zhou2020neural
  • Lemma 4
  • proof