Table of Contents
Fetching ...

Data-Driven Online Model Selection With Regret Guarantees

Aldo Pacchiano, Christoph Dann, Claudio Gentile

TL;DR

This work tackles online model selection among a pool of base learners in stochastic bandit-like environments without relying on predefined candidate regret bounds. It introduces two data-driven regret-balancing meta-algorithms, D3RB and ED2RB, which learn regret coefficients from data and regulate exploration via balancing potentials, yielding regret bounds tied to realized regret rather than worst-case guarantees. Theoretical results show Reg(T) is of order $\tilde{O}(d M \sqrt{T} + d^2 \sqrt{M T})$, with $d$ representing data-dependent regret rates that can improve over time, and empirical tests demonstrate improvements over baselines such as Corral and RB Grid across multiple settings. By exploiting base-learner variability and avoiding fixed candidate bounds, the approach offers tighter, environment-adaptive guarantees and enhanced practical performance for online model selection in sequential decision problems.

Abstract

We consider model selection for sequential decision making in stochastic environments with bandit feedback, where a meta-learner has at its disposal a pool of base learners, and decides on the fly which action to take based on the policies recommended by each base learner. Model selection is performed by regret balancing but, unlike the recent literature on this subject, we do not assume any prior knowledge about the base learners like candidate regret guarantees; instead, we uncover these quantities in a data-driven manner. The meta-learner is therefore able to leverage the realized regret incurred by each base learner for the learning environment at hand (as opposed to the expected regret), and single out the best such regret. We design two model selection algorithms operating with this more ambitious notion of regret and, besides proving model selection guarantees via regret balancing, we experimentally demonstrate the compelling practical benefits of dealing with actual regrets instead of candidate regret bounds.

Data-Driven Online Model Selection With Regret Guarantees

TL;DR

This work tackles online model selection among a pool of base learners in stochastic bandit-like environments without relying on predefined candidate regret bounds. It introduces two data-driven regret-balancing meta-algorithms, D3RB and ED2RB, which learn regret coefficients from data and regulate exploration via balancing potentials, yielding regret bounds tied to realized regret rather than worst-case guarantees. Theoretical results show Reg(T) is of order , with representing data-dependent regret rates that can improve over time, and empirical tests demonstrate improvements over baselines such as Corral and RB Grid across multiple settings. By exploiting base-learner variability and avoiding fixed candidate bounds, the approach offers tighter, environment-adaptive guarantees and enhanced practical performance for online model selection in sequential decision problems.

Abstract

We consider model selection for sequential decision making in stochastic environments with bandit feedback, where a meta-learner has at its disposal a pool of base learners, and decides on the fly which action to take based on the policies recommended by each base learner. Model selection is performed by regret balancing but, unlike the recent literature on this subject, we do not assume any prior knowledge about the base learners like candidate regret guarantees; instead, we uncover these quantities in a data-driven manner. The meta-learner is therefore able to leverage the realized regret incurred by each base learner for the learning environment at hand (as opposed to the expected regret), and single out the best such regret. We design two model selection algorithms operating with this more ambitious notion of regret and, besides proving model selection guarantees via regret balancing, we experimentally demonstrate the compelling practical benefits of dealing with actual regrets instead of candidate regret bounds.
Paper Structure (26 sections, 15 theorems, 50 equations, 5 figures, 4 tables, 3 algorithms)

This paper contains 26 sections, 15 theorems, 50 equations, 5 figures, 4 tables, 3 algorithms.

Key Result

Theorem 3.1

With probability at least $1 - \delta$, the regret of D$^3$RB (alg:balancing, left) with parameters $\delta$ and $d_{\min} \geq 1$ is bounded in all rounds $T \in \mathbb{N}$ asHere and throughout, $\tilde{O}$ hides log-factors. where $\bar{d}^\star_T = \min_{i \in [M]} \bar{d}^i_T = \min_{i \in [M]} \max_{t \in [T]} d^i_t$ is the smallest monotonic regret coefficient among all learners (see def:r

Figures (5)

  • Figure 1: Left: Expected regret of two base learners (UCB on MAB with confidence scaling $c$ controlling explore-exploit trade-off) and a model selection algorithm on top of them. The model selection algorithm has smaller expected regret than any base learner. Right: Expected regret and individual regret realizations (independent sample runs) of base learners. The base learners have highly variable performance which model selection can capitalize on. Detailed setup in app:earlyfigdetails.
  • Figure 2: Illustration of def:regretcoeff for one of the baseline realizations from fig:expected_regret_motivation. Left: Evolution of regret scale, coefficient and monotonic coefficient. Right: The same curves multiplied by $\sqrt{k}$. The induced regret bounds from regret coefficients follow the realized regret closely, the non-monotonic version more closely than the monotonic.
  • Figure 3: Average performance comparing all meta-learners (see tab:general_overview for reference). Experiment $1$:. Self model selection. See also Figure \ref{['fig:expected_regret_sample_runs_appendix']} in Appendix \ref{['app:detailexperiments']}, containing regret curves for D$^3$RB and ED$^2$RB on a single realization. Experiment $2$: base learners ( UCB) with different confidence multipliers $c$. Experiments $3$ and $4$: Dimensionality $d = 10$. Experiments $5$ and $6$: True dimensionality $d^{i_\star} = 5$ and maximal dimensionality $d_M = 15$. In Experiments$3$ and $5$ the action set is the unit sphere. In Experiments$4$ and $6$ the contexts $x_t$ are $10$ actions sampled uniformly from the unit sphere.
  • Figure 4: Experiment Map.
  • Figure 5: Experiment $1$ (see tab:general_overview for reference). Regret for D$^3$RB and ED$^2$RB (alg:balancing) on a single realization.

Theorems & Definitions (17)

  • Definition 2.1: regret scale and coefficients
  • Theorem 3.1
  • Theorem 3.2
  • Definition B.1
  • Lemma B.2
  • Lemma B.3: Balancing potential lemma
  • Lemma C.1
  • Corollary C.2
  • Lemma C.3
  • Lemma C.4
  • ...and 7 more