Table of Contents
Fetching ...

A Classification View on Meta Learning Bandits

Mirco Mutti, Jeongyeol Kwon, Shie Mannor, Aviv Tamar

TL;DR

The paper addresses the challenge of designing fast, interpretable exploration for a finite collection of bandit tasks by recasting meta-learning bandits under a separation condition as a classification problem. It introduces a formal classification-based framework with a classification-coefficient $C_{\lambda}$ and an explicit classify-then-exploit strategy (ECE), plus a practical DT-ECE variant that builds an interpretable decision-tree plan via offline meta-training. The authors prove an instance-dependent regret bound $Reg_H(\mathbb{M}) = O(\lambda^{-2} C_{\lambda}(\mathbb{M}) \log^2 (MH))$ in the test phase and show near-optimality through lower bounds, while providing a tractable meta-training pipeline with estimation guarantees and a decision-tree classifier. Empirically, DT-ECE matches or rivals latent-bandit baselines on non-contextual tasks, delivering interpretable plans and rapid task identification (often within few hundred samples) and demonstrating robustness to misspecification. The framework offers a principled path to interpretable, data-efficient exploration that could extend to contextual MDPs and RL, enabling human-friendly, test-time decision making in complex sequential settings.

Abstract

Contextual multi-armed bandits are a popular choice to model sequential decision-making. E.g., in a healthcare application we may perform various tests to asses a patient condition (exploration) and then decide on the best treatment to give (exploitation). When humans design strategies, they aim for the exploration to be fast, since the patient's health is at stake, and easy to interpret for a physician overseeing the process. However, common bandit algorithms are nothing like that: The regret caused by exploration scales with $\sqrt{H}$ over $H$ rounds and decision strategies are based on opaque statistical considerations. In this paper, we use an original classification view to meta learn interpretable and fast exploration plans for a fixed collection of bandits $\mathbb{M}$. The plan is prescribed by an interpretable decision tree probing decisions' payoff to classify the test bandit. The test regret of the plan in the stochastic and contextual setting scales with $O (λ^{-2} C_λ (\mathbb{M}) \log^2 (MH))$, being $M$ the size of $\mathbb{M}$, $λ$ a separation parameter over the bandits, and $C_λ(\mathbb{M})$ a novel classification-coefficient that fundamentally links meta learning bandits with classification. Through a nearly matching lower bound, we show that $C_λ(\mathbb{M})$ inherently captures the complexity of the setting.

A Classification View on Meta Learning Bandits

TL;DR

The paper addresses the challenge of designing fast, interpretable exploration for a finite collection of bandit tasks by recasting meta-learning bandits under a separation condition as a classification problem. It introduces a formal classification-based framework with a classification-coefficient and an explicit classify-then-exploit strategy (ECE), plus a practical DT-ECE variant that builds an interpretable decision-tree plan via offline meta-training. The authors prove an instance-dependent regret bound in the test phase and show near-optimality through lower bounds, while providing a tractable meta-training pipeline with estimation guarantees and a decision-tree classifier. Empirically, DT-ECE matches or rivals latent-bandit baselines on non-contextual tasks, delivering interpretable plans and rapid task identification (often within few hundred samples) and demonstrating robustness to misspecification. The framework offers a principled path to interpretable, data-efficient exploration that could extend to contextual MDPs and RL, enabling human-friendly, test-time decision making in complex sequential settings.

Abstract

Contextual multi-armed bandits are a popular choice to model sequential decision-making. E.g., in a healthcare application we may perform various tests to asses a patient condition (exploration) and then decide on the best treatment to give (exploitation). When humans design strategies, they aim for the exploration to be fast, since the patient's health is at stake, and easy to interpret for a physician overseeing the process. However, common bandit algorithms are nothing like that: The regret caused by exploration scales with over rounds and decision strategies are based on opaque statistical considerations. In this paper, we use an original classification view to meta learn interpretable and fast exploration plans for a fixed collection of bandits . The plan is prescribed by an interpretable decision tree probing decisions' payoff to classify the test bandit. The test regret of the plan in the stochastic and contextual setting scales with , being the size of , a separation parameter over the bandits, and a novel classification-coefficient that fundamentally links meta learning bandits with classification. Through a nearly matching lower bound, we show that inherently captures the complexity of the setting.

Paper Structure

This paper contains 30 sections, 16 theorems, 69 equations, 4 figures, 6 algorithms.

Key Result

theorem 2.1

Let $\mathbb{M}$ a set of $M \geq 2$ bandits and let $\mathcal{X} = \{x \}$ be a singleton. The test regret is $\mathop{\mathrm{Reg}}\nolimits_H (\mathbb{M}) = \Omega (\sqrt{MH})$.

Figures (4)

  • Figure 1: Left: An excerpt from a clinical flowchart for the medical diagnosis of Asthma gif2023global. Right: Illustration of an interpretable exploration plan for a MAB.
  • Figure 2: The meta learning bandits problem setting.
  • Figure 3: Visualization of a generic split of $\texttt{tree} (\hat{\mathbb{M}})$.
  • Figure 4: Regret of DT-ECE (ours), mUCB lazaric2013sequential, mTS hong2020latent. Captions report envname-M-K, denoting the name of the collection of bandits, the size of the collection, and the number of arms, respectively, together with the value of the separation parameter $\lambda$. The curves average 20 independent runs, shaded regions are 95% c.i.

Theorems & Definitions (23)

  • theorem 2.1: lai1985asymptotically
  • theorem 3.1
  • lemma 3.2
  • theorem 3.3
  • lemma 4.1
  • theorem 4.2
  • lemma 4.3
  • theorem 4.4
  • lemma 1.1: Ville's Inequality
  • lemma 1.2: Uniform Bound on the Likelihood Ratios
  • ...and 13 more