Table of Contents
Fetching ...

Information maximization for a broad variety of multi-armed bandit games

Alex Barbier-Chebbah, Christian L. Vestergaard, Jean-Baptiste Masson

TL;DR

This work extends the information-maximization paradigm to three structured multi-armed bandit problems—Explore-$m$, linear, and many-armed bandits—addressing over-exploration through problem-tailored observables and tractable approximations. It introduces PacAIM for identifying an epsilon-optimal top-$m$ subset via a separator $ heta_b$ and a stopping rule based on $P_{ m top}P_{ m bot}$, LinAIM which adapts AIM to linear payoffs by weighting information gains with suboptimality likelihood, and ARM, a finite-horizon strategy that minimizes upcoming regret by balancing exploration of new arms against exploitation of the current best. The methods rely on entropy-based gains, Gaussian/posterior approximations, and extreme-value analyses to yield tractable, implementable policies with competitive empirical performance against standard baselines. The results indicate robust gains across Gaussian and Bernoulli reward settings and provide a unified framework to extend information-based decision-making to broader structured bandit problems, with future work aimed at theoretical performance guarantees and further extensions to non-Gaussian and heavier-tailed rewards.

Abstract

Information and free-energy maximization are physics principles that provide general rules for an agent to optimize actions in line with specific goals and policies. These principles are the building blocks for designing decision-making policies capable of efficient performance with only partial information. Notably, the information maximization principle has shown remarkable success in the classical bandit problem and has recently been shown to yield optimal algorithms for Gaussian and sub-Gaussian reward distributions. This article explores a broad extension of physics-based approaches to more complex and structured bandit problems. To this end, we cover three distinct types of bandit problems, where information maximization is adapted and leads to strong performance. Since the main challenge of information maximization lies in avoiding over-exploration, we highlight how information is tailored at various levels to mitigate this issue, paving the way for more efficient and robust decision-making strategies.

Information maximization for a broad variety of multi-armed bandit games

TL;DR

This work extends the information-maximization paradigm to three structured multi-armed bandit problems—Explore-, linear, and many-armed bandits—addressing over-exploration through problem-tailored observables and tractable approximations. It introduces PacAIM for identifying an epsilon-optimal top- subset via a separator and a stopping rule based on , LinAIM which adapts AIM to linear payoffs by weighting information gains with suboptimality likelihood, and ARM, a finite-horizon strategy that minimizes upcoming regret by balancing exploration of new arms against exploitation of the current best. The methods rely on entropy-based gains, Gaussian/posterior approximations, and extreme-value analyses to yield tractable, implementable policies with competitive empirical performance against standard baselines. The results indicate robust gains across Gaussian and Bernoulli reward settings and provide a unified framework to extend information-based decision-making to broader structured bandit problems, with future work aimed at theoretical performance guarantees and further extensions to non-Gaussian and heavier-tailed rewards.

Abstract

Information and free-energy maximization are physics principles that provide general rules for an agent to optimize actions in line with specific goals and policies. These principles are the building blocks for designing decision-making policies capable of efficient performance with only partial information. Notably, the information maximization principle has shown remarkable success in the classical bandit problem and has recently been shown to yield optimal algorithms for Gaussian and sub-Gaussian reward distributions. This article explores a broad extension of physics-based approaches to more complex and structured bandit problems. To this end, we cover three distinct types of bandit problems, where information maximization is adapted and leads to strong performance. Since the main challenge of information maximization lies in avoiding over-exploration, we highlight how information is tailored at various levels to mitigate this issue, paving the way for more efficient and robust decision-making strategies.

Paper Structure

This paper contains 36 sections, 73 equations, 7 figures, 3 algorithms.

Figures (7)

  • Figure 1: Illustrations of multi-armed bandit settings addressed in this work. a) Illustration of the multi-armed bandit principle. At each time step $t$ , the agent chooses an action $i=A_t$ that returns a reward $x_t$ drawn from a distribution of unknown mean $\mu_{i}$. The arm is denoted as $A_t$ when it is a vector and as $a_{t}$ when it is an index. By accumulating rewards, the agent also gathers information about the average rewards of the arms and optimizes its decision-making policy. The specific goals of the agent will vary depending on the bandit problem. b) General principle of the Explore-$m$ problem. Here, the goal is to identify the $m$ arms with the highest mean rewards (red arms). To this end, our approach introduces a well-designed separator $\theta_{b}$ between the current $m$ highest arms (forming the set $\mathcal{M}_t$) and the remaining arms at each time. Our algorithm chooses the arm to maximize the information gain of the separator's effectiveness in distinguishing both subsets (cumulative distribution pictured in yellow and red below). c) Illustration of the linear bandit setting. Here, arms are d-dimensional vectors resampled at each time step. The scalar between the arm vector and a constant but unknown vector gives the mean reward value $\theta_*$. Because of this geometric dependency on the reward, pulling an arm also provides information on the expected results of the others. d) Illustration of the may-arms settings. Here, the agent has a finite duration, known in advance to maximizes its gains. Because there is no time to explore all the arms thoroughly, the agent needs to focus on a small subset of arms to isolate a sufficiently promising solution before time runs out. Our approach balances two possible actions: exploring a new arm (represented in blue) or greedily exploiting the current empirical best solution (represented in green) based on the empirical mean $\mu_{M_{t}}$. We model the upcoming losses as a function to guide the decision-making process.
  • Figure 2: Stopping time of Explore-$m$ algorithms for Gaussian rewards. The stopping time, i.e., the time when the stopping criteria for each algorithm is met, is measured for different numbers of arms $n$. In blue our algorithm, and in red LUCB1 given in kalyanakrishnanPACSubsetSelection2012a. The $\epsilon$-optimal subset size verifies $m= n/5$. In a) the mean rewards value are sampled from two disjoint ensemble $[0,0.8]$ and $[0.9,1]$ for the $\epsilon$-optimal subset, with confidence parameters $\epsilon=0.05$, $\delta=0.1$. b) Suboptimal mean rewards value are all set to $0.7$ while $\epsilon$-optimal to $0.8$, with confidence parameters $\epsilon=0.05$, $\delta=0.02$. The time is averaged over $100$ runs with standard deviation indicated. See \ref{['supp:Secnumericalexpe', 'supp:Secotheralgo']} for numerical details and success probability rate.
  • Figure 3: Mean regret for two linear bandit settings with Gaussian rewards. In blue our algorithm and in orange and red UCB algorithms adjusted to the linear settings. (a) Bandit setting with $10$ arms resampled at each time from a normal distribution in $\mathbb{R}^{10}$. (b) Toy problem borrowed from tirinzoniAsymptoticallyOptimalPrimaldual2020 with two contexts. Each context (detailed in the main text) is drawn with equal probability and $\xi=0.5$. Error bars show standard error around the mean. Details of the algorithms, simulations and a focus on standard deviations are provided in the Supplementary Material \ref{['supp:Secnumericalexpe', 'supp:Secotheralgo']}.
  • Figure 4: Mean regret for the many-armed bandit problem with Bernoulli rewards and a uniform prior on $[0,1]$. (a) The regret growth of our algorithm (ARM), in blue, is observed until horizon $T$, i.e the stopping time, is reached $T=10000$ ($K=5000 \gg \sqrt{T}$ and $c=1$). The regret is averaged over $2000$ runs and standard error are indicated (see Supplementary Material \ref{['supp:Secnumericalexpe', 'supp:Secotheralgo']} for details on the numerical settings). Its performance is compared to the ss-greedy algorithm, in red. The final slopes of both algorithms differs indicating that the selected arm for most of the exploitation is less efficient than the one found for our algorithm. Additionally, our algorithm presents a slower initial slope because exploration and exploitation are already mixed. (b) Mean regret performance when at the stopping time (when the horizon is reached), for distinct games with varying horizon length ($K=T \gg \sqrt{T}$ and $c=1$). The regret is averaged over $2000$ runs with error bars indicating standard error on the mean. Details of the algorithms and simulations, as well as a focus on standard deviations, are provided in the Supplementary Material \ref{['supp:Secnumericalexpe', 'supp:Secotheralgo']}.
  • Figure S1: Success probability of the Explore-$m$ experiments presented in the main text. The probability of identifying the $\epsilon$-optimal subset is averaged over $100$ runs for different numbers of arms $n$. Results are shown for our algorithm (blue) and LUCB1 kalyanakrishnanPACSubsetSelection2012a (red). The $\epsilon$-optimal subset size satisfies $m = n/5$. (a) Mean reward values are sampled from two disjoint ensembles $[0, 0.7]$ and $[0.8, 1]$, with confidence parameters $\epsilon=0.05$ and $\delta=0.1$. (b) Suboptimal mean reward values are fixed at $0.7$, while $\epsilon$-optimal rewards are set to $0.8$, with confidence parameters $\epsilon=0.05$ and $\delta=0.02$. Both experiments demonstrate an empirical success probability exceeding the tolerance threshold $1-\delta$, validating the stopping time measured in the main text.
  • ...and 2 more figures