Table of Contents
Fetching ...

In-Context Learning for Pure Exploration

Alessio Russo, Ryan Welch, Aldo Pacchiano

TL;DR

The paper addresses active sequential hypothesis testing (pure exploration) by introducing In-Context Pure Exploration (ICPE), a Transformer-based meta-learning framework that jointly learns data-collection policies and inference rules across task families. ICPE trains two Transformers, $I_\\phi$ for posterior-based inference and $Q_\\theta$ for action selection, and supports both fixed-budget and fixed-confidence settings without requiring explicit likelihood models at test time; inference is performed by a simple forward pass, relying on learned priors over hypotheses. Theoretical results show that the optimal inference is the posterior-maximum and that the learned RL objective aligns with information-rich data collection, with clear fixed-budget and fixed-confidence policy characterizations and stopping criteria that achieve $\\delta$-correctness under identifiability assumptions. Empirically, ICPE matches or surpasses principled baselines on stochastic/deterministic bandits and generalized search tasks (e.g., MNIST region sampling, probabilistic binary search), demonstrating robust transfer across non-tabular environments and latent information structures. This work highlights Transformers as practical, structure-aware architectures for sequential testing and meta-learning, enabling efficient hypothesis identification across diverse tasks without hand-crafted models of the information structure.

Abstract

We study the problem active sequential hypothesis testing, also known as pure exploration: given a new task, the learner adaptively collects data from the environment to efficiently determine an underlying correct hypothesis. A classical instance of this problem is the task of identifying the best arm in a multi-armed bandit problem (a.k.a. BAI, Best-Arm Identification), where actions index hypotheses. Another important case is generalized search, a problem of determining the correct label through a sequence of strategically selected queries that indirectly reveal information about the label. In this work, we introduce In-Context Pure Exploration (ICPE), which meta-trains Transformers to map observation histories to query actions and a predicted hypothesis, yielding a model that transfers in-context. At inference time, ICPE actively gathers evidence on new tasks and infers the true hypothesis without parameter updates. Across deterministic, stochastic, and structured benchmarks, including BAI and generalized search, ICPE is competitive with adaptive baselines while requiring no explicit modeling of information structure. Our results support Transformers as practical architectures for general sequential testing.

In-Context Learning for Pure Exploration

TL;DR

The paper addresses active sequential hypothesis testing (pure exploration) by introducing In-Context Pure Exploration (ICPE), a Transformer-based meta-learning framework that jointly learns data-collection policies and inference rules across task families. ICPE trains two Transformers, for posterior-based inference and for action selection, and supports both fixed-budget and fixed-confidence settings without requiring explicit likelihood models at test time; inference is performed by a simple forward pass, relying on learned priors over hypotheses. Theoretical results show that the optimal inference is the posterior-maximum and that the learned RL objective aligns with information-rich data collection, with clear fixed-budget and fixed-confidence policy characterizations and stopping criteria that achieve -correctness under identifiability assumptions. Empirically, ICPE matches or surpasses principled baselines on stochastic/deterministic bandits and generalized search tasks (e.g., MNIST region sampling, probabilistic binary search), demonstrating robust transfer across non-tabular environments and latent information structures. This work highlights Transformers as practical, structure-aware architectures for sequential testing and meta-learning, enabling efficient hypothesis identification across diverse tasks without hand-crafted models of the information structure.

Abstract

We study the problem active sequential hypothesis testing, also known as pure exploration: given a new task, the learner adaptively collects data from the environment to efficiently determine an underlying correct hypothesis. A classical instance of this problem is the task of identifying the best arm in a multi-armed bandit problem (a.k.a. BAI, Best-Arm Identification), where actions index hypotheses. Another important case is generalized search, a problem of determining the correct label through a sequence of strategically selected queries that indirectly reveal information about the label. In this work, we introduce In-Context Pure Exploration (ICPE), which meta-trains Transformers to map observation histories to query actions and a predicted hypothesis, yielding a model that transfers in-context. At inference time, ICPE actively gathers evidence on new tasks and infers the true hypothesis without parameter updates. Across deterministic, stochastic, and structured benchmarks, including BAI and generalized search, ICPE is competitive with adaptive baselines while requiring no explicit modeling of information structure. Our results support Transformers as practical architectures for general sequential testing.

Paper Structure

This paper contains 100 sections, 27 theorems, 197 equations, 23 figures, 4 tables, 4 algorithms.

Key Result

Proposition 3.1

Let $t\geq 1$ and a policy $\pi$. The optimal inference rule to $\sup_{I_t} \mathbb P^\pi(H^\star=I_t({\cal D}_t))$ is given by $I_t^\star(z)=\mathop{\mathrm{arg\,max}}\limits_{H\in {\cal H}} {\mathbb P}(H^\star=H|{\cal D}_t=z)$.

Figures (23)

  • Figure 1: (a) Generalized search example: \ref{['algo:icpe_fixed_confidence']} starts from a masked image (left), and sequentially reveals patches expected to reduce the posterior entropy over labels. It stops once the inferred label is $\delta$-correct (right). (b) After executing an action $a_t$, the agent observes $x_{t+1}$. At inference time, the data collected is used to infer an hypothesis.
  • Figure 2: Results for stochastic MABs with fixed confidence $\delta=0.1$ and $N=100$: (a) average stopping time $\tau$; (b) survival function of $\tau$; (c) probability of correctness $\mathbb{P}^\pi(\hat{H}_\tau=H^\star)$.
  • Figure 3: Deterministic bandits: (left) probability of correctly identifying the best action vs. $K$; (right) average fraction of unique actions selected during exploration vs. $K$.
  • Figure 4: (a) Single magic action: average stopping time and the theoretical lower bound across varying $\sigma_m$. (b) Magic chain: average stopping time between \ref{['algo:icpe_fixed_confidence']}, $I$-IDS vs. number of magic actions. (c)\ref{['algo:icpe_fixed_confidence']} in a regret minimization task, with $\sigma_m=0.1$.
  • Figure 5: MNIST pixel-sampling task: (a) A chord between two digits indicates that their distributions were not significantly different ($p$-value $>0.05$, based on a pairwise chi-squared test), with thicker chords representing higher $p$-values; (b) accuracy and performance (mean $\pm$ 95% CI)
  • ...and 18 more figures

Theorems & Definitions (51)

  • Example 2.1: Best Arm Identification
  • Proposition 3.1: Inference Rule Optimality
  • Theorem 3.2: Policy Optimality for Fixed Budget
  • Theorem 3.3: Policy Optimality for Fixed Confidence
  • Lemma B.1: Posterior kernel
  • proof
  • Proposition B.2
  • proof
  • Proposition B.3
  • proof
  • ...and 41 more