Neural Exploitation and Exploration of Contextual Bandits
Yikun Ban, Yuchen Yan, Arindam Banerjee, Jingrui He
TL;DR
This work tackles the exploration-exploitation dilemma in contextual bandits with non-linear rewards by introducing EE-Net, a dual-network framework consisting of an exploitation network f1 and an exploration network f2. The key idea is to use the gradient of f1 as input to f2, which learns the residual 'potential gain' h(x) − f1(x) to drive adaptive upward or downward exploration. The authors establish an instance-dependent regret bound of \\tilde{O}(\\sqrt{T}) under over-parameterization and show that EE-Net outperforms both linear and neural baselines on multiple real-world datasets, with favorable inference-time characteristics. The approach integrates neural representation learning for both reward estimation and exploration, offering a scalable and effective strategy for complex, non-linear bandit problems.
Abstract
In this paper, we study utilizing neural networks for the exploitation and exploration of contextual multi-armed bandits. Contextual multi-armed bandits have been studied for decades with various applications. To solve the exploitation-exploration trade-off in bandits, there are three main techniques: epsilon-greedy, Thompson Sampling (TS), and Upper Confidence Bound (UCB). In recent literature, a series of neural bandit algorithms have been proposed to adapt to the non-linear reward function, combined with TS or UCB strategies for exploration. In this paper, instead of calculating a large-deviation based statistical bound for exploration like previous methods, we propose, ``EE-Net,'' a novel neural-based exploitation and exploration strategy. In addition to using a neural network (Exploitation network) to learn the reward function, EE-Net uses another neural network (Exploration network) to adaptively learn the potential gains compared to the currently estimated reward for exploration. We provide an instance-based $\widetilde{\mathcal{O}}(\sqrt{T})$ regret upper bound for EE-Net and show that EE-Net outperforms related linear and neural contextual bandit baselines on real-world datasets.
