Table of Contents
Fetching ...

Neural Exploitation and Exploration of Contextual Bandits

Yikun Ban, Yuchen Yan, Arindam Banerjee, Jingrui He

TL;DR

This work tackles the exploration-exploitation dilemma in contextual bandits with non-linear rewards by introducing EE-Net, a dual-network framework consisting of an exploitation network f1 and an exploration network f2. The key idea is to use the gradient of f1 as input to f2, which learns the residual 'potential gain' h(x) − f1(x) to drive adaptive upward or downward exploration. The authors establish an instance-dependent regret bound of \\tilde{O}(\\sqrt{T}) under over-parameterization and show that EE-Net outperforms both linear and neural baselines on multiple real-world datasets, with favorable inference-time characteristics. The approach integrates neural representation learning for both reward estimation and exploration, offering a scalable and effective strategy for complex, non-linear bandit problems.

Abstract

In this paper, we study utilizing neural networks for the exploitation and exploration of contextual multi-armed bandits. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration trade-off in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, a series of neural bandit algorithms have been proposed to adapt to the non-linear reward function, combined with TS or UCB strategies for exploration. In this paper, instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose, ``EE-Net,'' a novel neural-based exploitation and exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn the potential gains compared to the currently estimated reward for exploration. We provide an instance-based $\widetilde{\mathcal{O}}(\sqrt{T})$ regret upper bound for EE-Net and show that EE-Net outperforms related linear and neural contextual bandit baselines on real-world datasets.

Neural Exploitation and Exploration of Contextual Bandits

TL;DR

This work tackles the exploration-exploitation dilemma in contextual bandits with non-linear rewards by introducing EE-Net, a dual-network framework consisting of an exploitation network f1 and an exploration network f2. The key idea is to use the gradient of f1 as input to f2, which learns the residual 'potential gain' h(x) − f1(x) to drive adaptive upward or downward exploration. The authors establish an instance-dependent regret bound of \\tilde{O}(\\sqrt{T}) under over-parameterization and show that EE-Net outperforms both linear and neural baselines on multiple real-world datasets, with favorable inference-time characteristics. The approach integrates neural representation learning for both reward estimation and exploration, offering a scalable and effective strategy for complex, non-linear bandit problems.

Abstract

In this paper, we study utilizing neural networks for the exploitation and exploration of contextual multi-armed bandits. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration trade-off in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, a series of neural bandit algorithms have been proposed to adapt to the non-linear reward function, combined with TS or UCB strategies for exploration. In this paper, instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose, ``EE-Net,'' a novel neural-based exploitation and exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn the potential gains compared to the currently estimated reward for exploration. We provide an instance-based regret upper bound for EE-Net and show that EE-Net outperforms related linear and neural contextual bandit baselines on real-world datasets.
Paper Structure (19 sections, 19 theorems, 78 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 19 sections, 19 theorems, 78 equations, 6 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

For any $\delta \in (0, 1), R >0$, suppose $m \geq \Omega \left( \text{poly} (T, L, R, n, \log (1/\delta) ) \right)$, $\eta_1 = \eta_2 =\frac{ R^2 }{\sqrt{m} }$. Then, with probability at least $1 - \delta$ over the initialization, the pseudo regret of Algorithm alg:main in $T$ rounds satisfies

Figures (6)

  • Figure 1: Exploration direction (right side): (1) "Upward" exploration should be performed when the model underestimates the arm's reward; (2) "Downward" exploration should be performed when the model overestimates the arm's reward. EE-Net (the proposed strategy), depicted in the left side, intends to adaptively make exploration according to the estimated potential gain of arm.
  • Figure 2: With the same exploitation network $f_1$, EE-Net outperforms neural-based baselines.
  • Figure 3: Ablation study on label function $y$ for $f_2$. EE-Net denotes $y_1 = r - f_1$, EE-Net-abs denotes $y_2= | r - f_1|$, and EE-Net-ReLU denotes $y_3 = \text{ReLU} (r- f_1)$. EE-Net shows the best performance on these two datasets.
  • Figure 4: Decision-making time
  • Figure 5: Extended rounds on Movielens and MNIST Datasets
  • ...and 1 more figures

Theorems & Definitions (43)

  • Definition 4.1: Potential Gain
  • Remark 4.1: Network structure
  • Remark 4.2: Exploration direction
  • Remark 4.3: Space complexity
  • Theorem 1
  • Remark 5.1
  • Remark 5.2
  • Lemma 5.1
  • Definition 5.1: NTK ntk2018neuralwang2021neural
  • Lemma 5.2
  • ...and 33 more