Table of Contents
Fetching ...

Conformal Bandits: Bringing statistical validity and reward efficiency to the small-gap regime

Simone Cuonzo, Nina Deliu

TL;DR

This work introduces Conformal Bandits, a framework that fuses Conformal Prediction with multi-armed bandits to achieve finite-sample predictive validity in small-gap settings while preserving regret-minimisation. It replaces classical Hoeffding-based bounds with conformal prediction intervals and extends the framework with risk-aware indices (e.g., Exploratory Skewness Index) and regime-aware adaptations, including Conformal UCB and Conformal Bandits. Through extensive simulations, the authors demonstrate nominal coverage and improved regret in small-gap regimes, even under heavy tails and skewness, and show practical gains in portfolio allocation when incorporating regime-switching via Hidden Markov Models. The paper also discusses limitations related to exchangeability and non-stationarity, and outlines future work including extensions to Thompson Sampling, theoretical regret analyses, and broader domain applications. Overall, Conformal Bandits offer a robust, data-driven approach that couples regret efficiency with principled uncertainty quantification for sequential decision-making in complex, uncertain environments.

Abstract

We introduce Conformal Bandits, a novel framework integrating Conformal Prediction (CP) into bandit problems, a classic paradigm for sequential decision-making under uncertainty. Traditional regret-minimisation bandit strategies like Thompson Sampling and Upper Confidence Bound (UCB) typically rely on distributional assumptions or asymptotic guarantees; further, they remain largely focused on regret, neglecting their statistical properties. We address this gap. Through the adoption of CP, we bridge the regret-minimising potential of a decision-making bandit policy with statistical guarantees in the form of finite-time prediction coverage. We demonstrate the potential of it Conformal Bandits through simulation studies and an application to portfolio allocation, a typical small-gap regime, where differences in arm rewards are far too small for classical policies to achieve optimal regret bounds in finite sample. Motivated by this, we showcase our framework's practical advantage in terms of regret in small-gap settings, as well as its added value in achieving nominal coverage guarantees where classical UCB policies fail. Focusing on our application of interest, we further illustrate how integrating hidden Markov models to capture the regime-switching behaviour of financial markets, enhances the exploration-exploitation trade-off, and translates into higher risk-adjusted regret efficiency returns, while preserving coverage guarantees.

Conformal Bandits: Bringing statistical validity and reward efficiency to the small-gap regime

TL;DR

This work introduces Conformal Bandits, a framework that fuses Conformal Prediction with multi-armed bandits to achieve finite-sample predictive validity in small-gap settings while preserving regret-minimisation. It replaces classical Hoeffding-based bounds with conformal prediction intervals and extends the framework with risk-aware indices (e.g., Exploratory Skewness Index) and regime-aware adaptations, including Conformal UCB and Conformal Bandits. Through extensive simulations, the authors demonstrate nominal coverage and improved regret in small-gap regimes, even under heavy tails and skewness, and show practical gains in portfolio allocation when incorporating regime-switching via Hidden Markov Models. The paper also discusses limitations related to exchangeability and non-stationarity, and outlines future work including extensions to Thompson Sampling, theoretical regret analyses, and broader domain applications. Overall, Conformal Bandits offer a robust, data-driven approach that couples regret efficiency with principled uncertainty quantification for sequential decision-making in complex, uncertain environments.

Abstract

We introduce Conformal Bandits, a novel framework integrating Conformal Prediction (CP) into bandit problems, a classic paradigm for sequential decision-making under uncertainty. Traditional regret-minimisation bandit strategies like Thompson Sampling and Upper Confidence Bound (UCB) typically rely on distributional assumptions or asymptotic guarantees; further, they remain largely focused on regret, neglecting their statistical properties. We address this gap. Through the adoption of CP, we bridge the regret-minimising potential of a decision-making bandit policy with statistical guarantees in the form of finite-time prediction coverage. We demonstrate the potential of it Conformal Bandits through simulation studies and an application to portfolio allocation, a typical small-gap regime, where differences in arm rewards are far too small for classical policies to achieve optimal regret bounds in finite sample. Motivated by this, we showcase our framework's practical advantage in terms of regret in small-gap settings, as well as its added value in achieving nominal coverage guarantees where classical UCB policies fail. Focusing on our application of interest, we further illustrate how integrating hidden Markov models to capture the regime-switching behaviour of financial markets, enhances the exploration-exploitation trade-off, and translates into higher risk-adjusted regret efficiency returns, while preserving coverage guarantees.

Paper Structure

This paper contains 25 sections, 29 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Cumulative regret (left) and best-arm selection (center) attained with UCB1 auer2002finite in a three-armed bandit with gap $\Delta_k = \mu^* - \mu_k$, where $\mu_k = 0$ for suboptimal arms $k \neq k^*$, and optimal arm mean $\mu^* = \mu_{k^*} \in \{0.01, 0.05, 0.1\}$. Arm rewards are drawn from a normal distribution $\mathcal{N}(\mu_k, \sigma = 0.1)$; right plot provides a comparison for $\Delta = \mu^* = 0.05$, case study in Section \ref{['sec: Simulation Studies']}.
  • Figure 2: Algorithm 1: Conformal bandits
  • Figure 3: Comparison among bandit policies in terms of cumulative regret and best-arm selection over time, for all reward scenarios. Reward means reflect a relatively small-gap scenario with $\Delta = 0.05$ ($\mu_1 = \mu^* = 0.05$ and $\mu_2 = \mu_3 = 0$). All results are expressed as average and $95\%$ bounds across the $1,000$ MC replicates.
  • Figure 4: Cumulative wealth of CP-based and UCB-based bandit policies under a partial-information setting, compared with EW, MV and SA portfolio benchmarks. Background shading highlights market regimes inferred via HMM: green denotes Bull phases, gray Neutral markets, and pink Bear episodes. Shaded bands around randomised CP policies indicate $95\%$ confidence intervals computed over $1,000$ MC runs.
  • Figure 5: Comparison between Conformal Bandit variants and classical UCB1 in the big-gap setting, based on $1000$ Monte Carlo simulations. Rows correspond to different reward-generating environments: Gaussian, Student-$t$, and skew-$t$ with asymmetric tails. Left column reports cumulative regret; right column reports cumulative best-arm selection rates. Shaded regions represent $95\%$ Monte Carlo confidence intervals.
  • ...and 3 more figures