Contextual Bandit Optimization with Pre-Trained Neural Networks

Mikhail Terekhov

Contextual Bandit Optimization with Pre-Trained Neural Networks

Mikhail Terekhov

TL;DR

The paper develops a theory for leveraging pre-trained neural representations in contextual bandits by proposing the Explore Twice then Commit (E2TC) algorithm, which uses Ridge regression to initialize the last layer and then jointly fine-tunes the last layer with SGD on top of a pre-trained representation. It proves regret bounds under local convexity and realizability, yielding a sublinear regime $\widetilde{O}((KT)^{4/5})$ when $d=O(T^{2/5})$ and $K=O(T^{1/5})$, and also provides misspecification-based linear-bias bounds for the weak-learning setting. The work combines fixed- and random-design Ridge analyses with high-probability SGD guarantees, plus preconditioning, to translate a two-stage exploration quality into overall regret. Experiments on online MNIST and wine-quality datasets illustrate practical gains from pre-training and the behavior of E2TC under realistic conditions. Overall, the results offer a principled pathway to incorporate pre-trained representations into neural contextual bandits, with concrete guidance on hyperparameter tuning and regimes where sublinear regret can be achieved.

Abstract

Bandit optimization is a difficult problem, especially if the reward model is high-dimensional. When rewards are modeled by neural networks, sublinear regret has only been shown under strong assumptions, usually when the network is extremely wide. In this thesis, we investigate how pre-training can help us in the regime of smaller models. We consider a stochastic contextual bandit with the rewards modeled by a multi-layer neural network. The last layer is a linear predictor, and the layers before it are a black box neural architecture, which we call a representation network. We model pre-training as an initial guess of the weights of the representation network provided to the learner. To leverage the pre-trained weights, we introduce a novel algorithm we call Explore Twice then Commit (E2TC). During its two stages of exploration, the algorithm first estimates the last layer's weights using Ridge regression, and then runs Stochastic Gradient Decent jointly on all the weights. For a locally convex loss function, we provide conditions on the pre-trained weights under which the algorithm can learn efficiently. Under these conditions, we show sublinear regret of E2TC when the dimension of the last layer and number of actions $K$ are much smaller than the horizon $T$. In the weak training regime, when only the last layer is learned, the problem reduces to a misspecified linear bandit. We introduce a measure of misspecification $ε_0$ for this bandit and use it to provide bounds $O(ε_0\sqrt{d}KT+(KT)^{4 /5})$ or $\tilde{O}(ε_0\sqrt{d}KT+d^{1 /3}(KT)^{2 /3})$ on the regret, depending on regularization strength. The first of these bounds has a dimension-independent sublinear term, made possible by the stochasticity of contexts. We also run experiments to evaluate the regret of E2TC and sample complexity of its exploration in practice.

Contextual Bandit Optimization with Pre-Trained Neural Networks

TL;DR

when

and

, and also provides misspecification-based linear-bias bounds for the weak-learning setting. The work combines fixed- and random-design Ridge analyses with high-probability SGD guarantees, plus preconditioning, to translate a two-stage exploration quality into overall regret. Experiments on online MNIST and wine-quality datasets illustrate practical gains from pre-training and the behavior of E2TC under realistic conditions. Overall, the results offer a principled pathway to incorporate pre-trained representations into neural contextual bandits, with concrete guidance on hyperparameter tuning and regimes where sublinear regret can be achieved.

Abstract

are much smaller than the horizon

. In the weak training regime, when only the last layer is learned, the problem reduces to a misspecified linear bandit. We introduce a measure of misspecification

for this bandit and use it to provide bounds

on the regret, depending on regularization strength. The first of these bounds has a dimension-independent sublinear term, made possible by the stochasticity of contexts. We also run experiments to evaluate the regret of E2TC and sample complexity of its exploration in practice.

Paper Structure (32 sections, 19 theorems, 293 equations, 4 figures, 1 algorithm)

This paper contains 32 sections, 19 theorems, 293 equations, 4 figures, 1 algorithm.

Introduction
Related Work
Bandits and optimization
Problem Formulation
Algorithm
Low risk to low regret
Empirical risk minimization
Linear estimation of the last layer
Fixed design Ridge regression
Random design Ridge regression
Strong convexity makes weak learning achieve sublinear regret
High-probability guarantees for stochastic gradient descent
Globally convex loss
Locally convex loss
Preconditioning in E2TC
...and 17 more sections

Key Result

Theorem 2.1

Let E2TC be run on a stochastic contextual bandit with pre-training. Let Assumptions as:context, as:realizability, and as:bound hold. Assume also that the weights $(\overline{\mathbf{w}},\overline{\theta})$ after the two exploration phases satisfy the following bound with $\mathbb{P}\ge 1-\delta$: Then the regret of E2TC will be bounded by

Figures (4)

Figure 1: Distributions of spectra of $\widehat{\Sigma(\theta_0)}$ with and without regularization \ref{['eq:cos_reg']}.
Figure 2: Sample images $I_k$ and encoded-decoded counterparts $\psi_{\tilde{\theta}}(\varphi_\theta(X_k))$. For each example, we selected the first data point with the corresponding class from the dataset. In the top row, we present digits 0-4 from the validation set used to evaluate the prediction quality of the pre-trained model. In the bottom row, we show how the autoencoder generalizes to unseen digits 5-9.
Figure 3: Regret of several bandit algorithms on the online MNIST classification task. For all algorithms we show the cumulative regret (number of misclassified digits so far) averaged over $20$ runs. For each curve we also show the empirical standard deviation. Since E2TC is not an anytime algorithm, we chose $10$ horizons for it and rerun the algorithm for each horizon with a different $T_2$.
Figure 4: Mean squared error on the test data for E2TC training and various ablations, averaged over $100$ training runs with random permutations of the training data. To de-clutter the plot, we split it into two canvases. Dashed curves show standard deviations. Horizontal dotted lines show the performance of two non-neural-network baselines: predicting the mean of the training data and running kernelized Support Vector Regression with an RBF kernel on it.

Theorems & Definitions (31)

Theorem 2.1
Proof 1: Theorem \ref{['thm:risktoregret']}
Lemma 3.1: The cost of misspecification
Proof 2
Lemma 3.2: Noise in well-specified linear bandits
Lemma 3.3: Regularization error in presence of misspecification
Proof 3
Lemma 3.4: Regularization error w.r.t. $\mathbf{w}^*$
Proof 4
Lemma 3.5: Bounded misspecification
...and 21 more

Contextual Bandit Optimization with Pre-Trained Neural Networks

TL;DR

Abstract

Contextual Bandit Optimization with Pre-Trained Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (31)