Contextual Bandit Optimization with Pre-Trained Neural Networks
Mikhail Terekhov
TL;DR
The paper develops a theory for leveraging pre-trained neural representations in contextual bandits by proposing the Explore Twice then Commit (E2TC) algorithm, which uses Ridge regression to initialize the last layer and then jointly fine-tunes the last layer with SGD on top of a pre-trained representation. It proves regret bounds under local convexity and realizability, yielding a sublinear regime $\widetilde{O}((KT)^{4/5})$ when $d=O(T^{2/5})$ and $K=O(T^{1/5})$, and also provides misspecification-based linear-bias bounds for the weak-learning setting. The work combines fixed- and random-design Ridge analyses with high-probability SGD guarantees, plus preconditioning, to translate a two-stage exploration quality into overall regret. Experiments on online MNIST and wine-quality datasets illustrate practical gains from pre-training and the behavior of E2TC under realistic conditions. Overall, the results offer a principled pathway to incorporate pre-trained representations into neural contextual bandits, with concrete guidance on hyperparameter tuning and regimes where sublinear regret can be achieved.
Abstract
Bandit optimization is a difficult problem, especially if the reward model is high-dimensional. When rewards are modeled by neural networks, sublinear regret has only been shown under strong assumptions, usually when the network is extremely wide. In this thesis, we investigate how pre-training can help us in the regime of smaller models. We consider a stochastic contextual bandit with the rewards modeled by a multi-layer neural network. The last layer is a linear predictor, and the layers before it are a black box neural architecture, which we call a representation network. We model pre-training as an initial guess of the weights of the representation network provided to the learner. To leverage the pre-trained weights, we introduce a novel algorithm we call Explore Twice then Commit (E2TC). During its two stages of exploration, the algorithm first estimates the last layer's weights using Ridge regression, and then runs Stochastic Gradient Decent jointly on all the weights. For a locally convex loss function, we provide conditions on the pre-trained weights under which the algorithm can learn efficiently. Under these conditions, we show sublinear regret of E2TC when the dimension of the last layer and number of actions $K$ are much smaller than the horizon $T$. In the weak training regime, when only the last layer is learned, the problem reduces to a misspecified linear bandit. We introduce a measure of misspecification $ε_0$ for this bandit and use it to provide bounds $O(ε_0\sqrt{d}KT+(KT)^{4 /5})$ or $\tilde{O}(ε_0\sqrt{d}KT+d^{1 /3}(KT)^{2 /3})$ on the regret, depending on regularization strength. The first of these bounds has a dimension-independent sublinear term, made possible by the stochasticity of contexts. We also run experiments to evaluate the regret of E2TC and sample complexity of its exploration in practice.
