Stochastic Bandits with ReLU Neural Networks

Kan Xu; Hamsa Bastani; Surbhi Goel; Osbert Bastani

Stochastic Bandits with ReLU Neural Networks

Kan Xu, Hamsa Bastani, Surbhi Goel, Osbert Bastani

TL;DR

This work addresses stochastic bandits where the reward is modeled by a one-layer ReLU neural network, and shows that a two-phase approach—exploration to enter a linear regime followed by a transformed-feature linear bandit—achieves a minimax regret of $\tilde{O}(\sqrt{T})$. By leveraging the piecewise linear structure, the authors convert the problem to a linear bandit in a higher-dimensional feature space and design OFU-ReLU and its batching variant OFU-ReLU+ to remove dependence on unknown parameters. They establish a parameter-estimation bound for the neurons (up to sign) and connect small generalization error to accurate neuron recovery, enabling online learning with provable guarantees. Empirical results on synthetic ReLU-bandits demonstrate notable improvements over linear OFUL and NeuralUCB baselines, suggesting practical potential for ReLU-based bandit algorithms in limited-time regimes.

Abstract

We study the stochastic bandit problem with ReLU neural network structure. We show that a $\tilde{O}(\sqrt{T})$ regret guarantee is achievable by considering bandits with one-layer ReLU neural networks; to the best of our knowledge, our work is the first to achieve such a guarantee. In this specific setting, we propose an OFU-ReLU algorithm that can achieve this upper bound. The algorithm first explores randomly until it reaches a linear regime, and then implements a UCB-type linear bandit algorithm to balance exploration and exploitation. Our key insight is that we can exploit the piecewise linear structure of ReLU activations and convert the problem into a linear bandit in a transformed feature space, once we learn the parameters of ReLU relatively accurately during the exploration stage. To remove dependence on model parameters, we design an OFU-ReLU+ algorithm based on a batching strategy, which can provide the same theoretical guarantee.

Stochastic Bandits with ReLU Neural Networks

TL;DR

. By leveraging the piecewise linear structure, the authors convert the problem to a linear bandit in a higher-dimensional feature space and design OFU-ReLU and its batching variant OFU-ReLU+ to remove dependence on unknown parameters. They establish a parameter-estimation bound for the neurons (up to sign) and connect small generalization error to accurate neuron recovery, enabling online learning with provable guarantees. Empirical results on synthetic ReLU-bandits demonstrate notable improvements over linear OFUL and NeuralUCB baselines, suggesting practical potential for ReLU-based bandit algorithms in limited-time regimes.

Abstract

We study the stochastic bandit problem with ReLU neural network structure. We show that a

regret guarantee is achievable by considering bandits with one-layer ReLU neural networks; to the best of our knowledge, our work is the first to achieve such a guarantee. In this specific setting, we propose an OFU-ReLU algorithm that can achieve this upper bound. The algorithm first explores randomly until it reaches a linear regime, and then implements a UCB-type linear bandit algorithm to balance exploration and exploitation. Our key insight is that we can exploit the piecewise linear structure of ReLU activations and convert the problem into a linear bandit in a transformed feature space, once we learn the parameters of ReLU relatively accurately during the exploration stage. To remove dependence on model parameters, we design an OFU-ReLU+ algorithm based on a batching strategy, which can provide the same theoretical guarantee.

Paper Structure (31 sections, 14 theorems, 116 equations, 4 figures, 2 algorithms)

This paper contains 31 sections, 14 theorems, 116 equations, 4 figures, 2 algorithms.

Introduction
Other Related Work
Problem Formulation
Parameter Estimation for ReLU Neural Networks
Algorithms for ReLU Bandits
Algorithm Design
OFU-ReLU Algorithm
OFU-ReLU+ Algorithm
Experiments
Conclusion
Proof of Proposition \ref{['prop:key']}
Intuition
Proof of Proposition \ref{['prop:key']}
Step 1.
Step 2.
...and 16 more sections

Key Result

Proposition 3.1

Suppose for some $\eta\in\mathbb{R}_{>0}$. Then, there exists a bijection $\sigma:[k]\to[k]$ such that where $h(\eta,\epsilon)\coloneqq\frac{k\epsilon^3|S^{d-3}|/2}{\epsilon^2(1-d\epsilon^2/2)|S^{d-2}|/8-\eta-6kd\epsilon^3|S^{d-2}|}$.

Figures (4)

Figure 1: Schematic representation of OFU-ReLU+.
Figure 2: Cumulative regret of a time horizon $T=1,000$ over 50 trials with 95% confidence interval.
Figure 3: Illustrations for proof sketch of the estimation error for ReLU Neural Networks. (a) The region $X'$ is the cylinder with caps consisting of the two green circles and radius $2\epsilon$. (b) Projected version of subfigure (a). (c) The green region is $X'$ with a section of length $O(\epsilon/\alpha)$ cut out.
Figure 4: Illustrations for proof sketch of the optimal action gap $\nu_*$.

Theorems & Definitions (20)

Proposition 3.1
Theorem 3.2
Definition 4.1
Proposition 4.2
Theorem 4.3
proof : Proof Sketch
Theorem 4.4
proof : Proof Sketch
Lemma 1.1
Lemma 1.2
...and 10 more

Stochastic Bandits with ReLU Neural Networks

TL;DR

Abstract

Stochastic Bandits with ReLU Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (20)