Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

Ruiqi Zhang; Yuexiang Zhai; Andrea Zanette

Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

Ruiqi Zhang, Yuexiang Zhai, Andrea Zanette

TL;DR

This work tackles offline decision making under extreme data scarcity in stochastic MABs. It introduces TRUST, a trust-region policy-optimization method that searches over stochastic policies around a data-driven reference policy, guided by a localized Gaussian complexity-based critical radius. Theoretical guarantees show TRUST can match or surpass LCB in data-starved regimes and provides actionable lower bounds on performance, with empirical results in both data-starved bandits and offline RL settings. The approach advances sample-efficient decision-making by leveraging stochastic policies, localization concepts, and a principled trade-off between exploration in policy space and estimation uncertainty. The framework also offers practical procedures for estimating the key complexity terms via Monte Carlo and discretization, enabling robust performance in real-data scenarios.

Abstract

What can an agent learn in a stochastic Multi-Armed Bandit (MAB) problem from a dataset that contains just a single sample for each arm? Surprisingly, in this work, we demonstrate that even in such a data-starved setting it may still be possible to find a policy competitive with the optimal one. This paves the way to reliable decision-making in settings where critical decisions must be made by relying only on a handful of samples. Our analysis reveals that \emph{stochastic policies can be substantially better} than deterministic ones for offline decision-making. Focusing on offline multi-armed bandits, we design an algorithm called Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST) which is quite different from the predominant value-based lower confidence bound approach. Its design is enabled by localization laws, critical radii, and relative pessimism. We prove that its sample complexity is comparable to that of LCB on minimax problems while being substantially lower on problems with very few samples. Finally, we consider an application to offline reinforcement learning in the special case where the logging policies are known.

Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

TL;DR

Abstract

Paper Structure (35 sections, 12 theorems, 90 equations, 3 figures, 3 tables, 2 algorithms)

This paper contains 35 sections, 12 theorems, 90 equations, 3 figures, 3 tables, 2 algorithms.

Introduction
Additional related work
Data-Starved Multi-Armed Bandits
Multi-armed bandits
Lower confidence bound algorithm
A data-starved MAB problem and failure of LCB
Can stochastic policies help?
Trust Region of Uncertainty for Stochastic policy enhancemenT (TRUST)
Decision variables
Trust region optimization
Trust region.
Critical radius.
Implementation details
Theoretical guarantees
Problem-dependent analysis
...and 20 more sections

Key Result

Theorem 3.1

Suppose the noise of arm $a_i$ is sub-Gaussian with proxy variance $\sigma_i^2.$ Let $\delta \in (0,1/2).$ Then, we have

Figures (3)

Figure 1: A simple diagram for the trust regions on a $3$-dim simplex. The central point is the reference (stochastic) policy, while red ellipses are trust regions around this reference policy.
Figure 2: The upper bound for the localized Gaussian width over a shifted simplex on $d=10000$ dimension. The shifted simplex is $\left\{ \Delta \in \mathbb{R}^d: \sum_{i=1}^d \Delta_i = 0\right\}.$ The two-staged upper bound we plot is based on Theorem 1 in bellec2019localized
Figure 3: Policy values and their lower bounds for a data-starved MAB instance with 10000 arms whose reward distribution is described in \ref{['eqn.reward.distribution']}.

Theorems & Definitions (21)

Theorem 3.1: LCB Performance
Definition 3.2: Stochastic Policies
Definition 4.1: Critical Radius
Definition 4.2: Quantile of the supremum of Gaussian process
Theorem 5.1: Main theorem
Lemma 5.2
Proposition 1.1
proof
Theorem 3.1
Corollary 3.2
...and 11 more

Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

TL;DR

Abstract

Is Offline Decision Making Possible with Only Few Samples? Reliable Decisions in Data-Starved Bandits via Trust Region Enhancement

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (21)