Table of Contents
Fetching ...

Stochastic Bandits with ReLU Neural Networks

Kan Xu, Hamsa Bastani, Surbhi Goel, Osbert Bastani

TL;DR

This work addresses stochastic bandits where the reward is modeled by a one-layer ReLU neural network, and shows that a two-phase approach—exploration to enter a linear regime followed by a transformed-feature linear bandit—achieves a minimax regret of $\tilde{O}(\sqrt{T})$. By leveraging the piecewise linear structure, the authors convert the problem to a linear bandit in a higher-dimensional feature space and design OFU-ReLU and its batching variant OFU-ReLU+ to remove dependence on unknown parameters. They establish a parameter-estimation bound for the neurons (up to sign) and connect small generalization error to accurate neuron recovery, enabling online learning with provable guarantees. Empirical results on synthetic ReLU-bandits demonstrate notable improvements over linear OFUL and NeuralUCB baselines, suggesting practical potential for ReLU-based bandit algorithms in limited-time regimes.

Abstract

We study the stochastic bandit problem with ReLU neural network structure. We show that a $\tilde{O}(\sqrt{T})$ regret guarantee is achievable by considering bandits with one-layer ReLU neural networks; to the best of our knowledge, our work is the first to achieve such a guarantee. In this specific setting, we propose an OFU-ReLU algorithm that can achieve this upper bound. The algorithm first explores randomly until it reaches a linear regime, and then implements a UCB-type linear bandit algorithm to balance exploration and exploitation. Our key insight is that we can exploit the piecewise linear structure of ReLU activations and convert the problem into a linear bandit in a transformed feature space, once we learn the parameters of ReLU relatively accurately during the exploration stage. To remove dependence on model parameters, we design an OFU-ReLU+ algorithm based on a batching strategy, which can provide the same theoretical guarantee.

Stochastic Bandits with ReLU Neural Networks

TL;DR

This work addresses stochastic bandits where the reward is modeled by a one-layer ReLU neural network, and shows that a two-phase approach—exploration to enter a linear regime followed by a transformed-feature linear bandit—achieves a minimax regret of . By leveraging the piecewise linear structure, the authors convert the problem to a linear bandit in a higher-dimensional feature space and design OFU-ReLU and its batching variant OFU-ReLU+ to remove dependence on unknown parameters. They establish a parameter-estimation bound for the neurons (up to sign) and connect small generalization error to accurate neuron recovery, enabling online learning with provable guarantees. Empirical results on synthetic ReLU-bandits demonstrate notable improvements over linear OFUL and NeuralUCB baselines, suggesting practical potential for ReLU-based bandit algorithms in limited-time regimes.

Abstract

We study the stochastic bandit problem with ReLU neural network structure. We show that a regret guarantee is achievable by considering bandits with one-layer ReLU neural networks; to the best of our knowledge, our work is the first to achieve such a guarantee. In this specific setting, we propose an OFU-ReLU algorithm that can achieve this upper bound. The algorithm first explores randomly until it reaches a linear regime, and then implements a UCB-type linear bandit algorithm to balance exploration and exploitation. Our key insight is that we can exploit the piecewise linear structure of ReLU activations and convert the problem into a linear bandit in a transformed feature space, once we learn the parameters of ReLU relatively accurately during the exploration stage. To remove dependence on model parameters, we design an OFU-ReLU+ algorithm based on a batching strategy, which can provide the same theoretical guarantee.
Paper Structure (31 sections, 14 theorems, 116 equations, 4 figures, 2 algorithms)

This paper contains 31 sections, 14 theorems, 116 equations, 4 figures, 2 algorithms.

Key Result

Proposition 3.1

Suppose for some $\eta\in\mathbb{R}_{>0}$. Then, there exists a bijection $\sigma:[k]\to[k]$ such that where $h(\eta,\epsilon)\coloneqq\frac{k\epsilon^3|S^{d-3}|/2}{\epsilon^2(1-d\epsilon^2/2)|S^{d-2}|/8-\eta-6kd\epsilon^3|S^{d-2}|}$.

Figures (4)

  • Figure 1: Schematic representation of OFU-ReLU+.
  • Figure 2: Cumulative regret of a time horizon $T=1,000$ over 50 trials with 95% confidence interval.
  • Figure 3: Illustrations for proof sketch of the estimation error for ReLU Neural Networks. (a) The region $X'$ is the cylinder with caps consisting of the two green circles and radius $2\epsilon$. (b) Projected version of subfigure (a). (c) The green region is $X'$ with a section of length $O(\epsilon/\alpha)$ cut out.
  • Figure 4: Illustrations for proof sketch of the optimal action gap $\nu_*$.

Theorems & Definitions (20)

  • Proposition 3.1
  • Theorem 3.2
  • Definition 4.1
  • Proposition 4.2
  • Theorem 4.3
  • proof : Proof Sketch
  • Theorem 4.4
  • proof : Proof Sketch
  • Lemma 1.1
  • Lemma 1.2
  • ...and 10 more