Table of Contents
Fetching ...

Neural Contextual Bandits with Deep Representation and Shallow Exploration

Pan Xu, Zheng Wen, Handong Zhao, Quanquan Gu

TL;DR

The paper introduces Neural-LinUCB, a neural contextual bandit method that decouples representation learning from exploration by transforming contexts with the deep network’s last hidden layer and performing UCB exploration only in the final linear layer. It provides a sublinear regret guarantee of order $\tilde{O}(\sqrt{T})$ under NTK-based assumptions and shows that the approach remains computationally efficient by avoiding exploration across all network parameters. Theoretical guarantees hinge on a combination of linear-bandit-style analysis and neural-network NTK properties, with an error term that depends on how well the neural network approximates the true reward function. Empirically, Neural-LinUCB outperforms NeuralUCB and Neural-Linear on real datasets while achieving substantially faster runtimes, highlighting the practical value of deep representation with shallow exploration.

Abstract

We study a general class of contextual bandits, where each context-action pair is associated with a raw feature vector, but the reward generating function is unknown. We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network (deep representation learning), and uses an upper confidence bound (UCB) approach to explore in the last linear layer (shallow exploration). We prove that under standard assumptions, our proposed algorithm achieves $\tilde{O}(\sqrt{T})$ finite-time regret, where $T$ is the learning time horizon. Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.

Neural Contextual Bandits with Deep Representation and Shallow Exploration

TL;DR

The paper introduces Neural-LinUCB, a neural contextual bandit method that decouples representation learning from exploration by transforming contexts with the deep network’s last hidden layer and performing UCB exploration only in the final linear layer. It provides a sublinear regret guarantee of order under NTK-based assumptions and shows that the approach remains computationally efficient by avoiding exploration across all network parameters. Theoretical guarantees hinge on a combination of linear-bandit-style analysis and neural-network NTK properties, with an error term that depends on how well the neural network approximates the true reward function. Empirically, Neural-LinUCB outperforms NeuralUCB and Neural-Linear on real datasets while achieving substantially faster runtimes, highlighting the practical value of deep representation with shallow exploration.

Abstract

We study a general class of contextual bandits, where each context-action pair is associated with a raw feature vector, but the reward generating function is unknown. We propose a novel learning algorithm that transforms the raw feature vector using the last hidden layer of a deep ReLU neural network (deep representation learning), and uses an upper confidence bound (UCB) approach to explore in the last linear layer (shallow exploration). We prove that under standard assumptions, our proposed algorithm achieves finite-time regret, where is the learning time horizon. Compared with existing neural contextual bandit algorithms, our approach is computationally much more efficient since it only needs to explore in the last layer of the deep neural network.

Paper Structure

This paper contains 15 sections, 12 theorems, 79 equations, 1 figure, 1 table, 2 algorithms.

Key Result

Theorem 4.4

Suppose Assumptions asp:nondegen, asp:gradient_fos and asp:ntk_pd hold. Assume that $\|\bm{\theta}^*\|_2\leq M$ for some positive constant $M>0$. For any $\delta\in(0,1)$, let us choose $\alpha_t$ in $\text{Neural-LinUCB}$ as We choose the step size $\eta_{q}$ of Algorithm alg:update_nn as and the width of the neural network satisfies $m=\text{poly}(L,d,1/\delta,H,\log(TK/\delta))$. With probabi

Figures (1)

  • Figure 1: The cumulative regrets of LinUCB, NeuralUCB, Neural-Linear and $\text{Neural-LinUCB}$ over $15,000$ rounds. Experiments are averaged over 10 repetitions.

Theorems & Definitions (14)

  • Theorem 4.4
  • Remark 4.5
  • Remark 4.6
  • Lemma A.1
  • Lemma A.2
  • Lemma A.3: Theorems 5 in cao2019generalization2
  • Lemma A.4
  • Lemma A.5
  • Lemma A.6
  • Lemma B.1: Theorem 3.1 in arora2019exact
  • ...and 4 more