Approximate information maximization for bandit games

Alex Barbier-Chebbah; Christian L. Vestergaard; Jean-Baptiste Masson; Etienne Boursier

Approximate information maximization for bandit games

Alex Barbier-Chebbah, Christian L. Vestergaard, Jean-Baptiste Masson, Etienne Boursier

TL;DR

An approximated analytical physics-based representation of an entropy to forecast the information gain of each action and greedily choose the one with the largest information gain is developed, proving its asymptotic optimality for the two-armed bandit problem with Gaussian rewards.

Abstract

Entropy maximization and free energy minimization are general physical principles for modeling the dynamics of various physical systems. Notable examples include modeling decision-making within the brain using the free-energy principle, optimizing the accuracy-complexity trade-off when accessing hidden variables with the information bottleneck principle (Tishby et al., 2000), and navigation in random environments using information maximization (Vergassola et al., 2007). Built on this principle, we propose a new class of bandit algorithms that maximize an approximation to the information of a key variable within the system. To this end, we develop an approximated analytical physics-based representation of an entropy to forecast the information gain of each action and greedily choose the one with the largest information gain. This method yields strong performances in classical bandit settings. Motivated by its empirical success, we prove its asymptotic optimality for the two-armed bandit problem with Gaussian rewards. Owing to its ability to encompass the system's properties in a global physical functional, this approach can be efficiently adapted to more complex bandit settings, calling for further investigation of information maximization approaches for multi-armed bandit problems.

Approximate information maximization for bandit games

TL;DR

Abstract

Paper Structure (48 sections, 7 theorems, 129 equations, 8 figures, 4 algorithms)

This paper contains 48 sections, 7 theorems, 129 equations, 8 figures, 4 algorithms.

Introduction
Contributions.
Organization.
Setting
Information maximization strategies
Algorithm design principle: physical intuition
Main elements of the entropy analytical approximation
Approximate information maximization algorithm
Regret bound
Sketch of the proof.
Experiments
Extensions
Exponential family bandits.
Other bandit settings.
Conclusion
...and 33 more sections

Key Result

Theorem 1

For Gaussian reward distributions with variance $\sigma^2$, the regret of AIM satisfies for any mean vector $\pmb{\mu}\in \mathbb{R}^K$ where $\mu^* = \max_{k\in[K]}\mu_k$.

Figures (8)

Figure 1: (a) Posterior distributions of a two-armed bandit with Gaussian rewards. The dotted lines represent the individual posterior distributions of each arm, $p_{M_t}$ and $p_{m}$, while the continuous line represents the posterior of the maximum mean reward of all arms, $p_{\mathrm{\max}}$ (\ref{['pmaxgeneralexpression']}). (b) Zoom of (a) around the point $\bar{\mu}_{\mathrm{eq, m}}$ where both arms have the same posterior probability of being the best one. $p_{M_t}C_{m}$ ($p_{m}C_{M_t}$) is the probability that the maximal value is given by the better (worse) empirical arm, and $\tilde{\mu}_{\mathrm{eq},m}$ is the approximation to $\bar{\mu}_{\mathrm{eq, m}}$ given in \ref{['SifinalformdeltaSi']}.
Figure 2: Evolution of the Bayesian regret for (a) 2-armed and (b) 50-armed bandit with Gaussian rewards under a uniform mean prior. Regret is averaged over $8000$ for (a) and $2000$ runs for (b) Confidence intervals show the standard deviation.
Figure 3: Evolution of the Bayesian regret for (a) 2-armed and (b) 50-armed bandit with Bernoulli rewards under a uniform mean prior. The regret is averaged over $16000$ runs for (a) and $2000$ runs for (b). Confidence intervals show the standard deviation.
Figure 4: Temporal evolution of the regret for 2-armed bandit with Gaussian rewards ($\sigma=1$) for close mean parameters. In blue AIM, in red Thompson sampling. Arm mean reward values are fixed with $\mu_1 = 0.8$ and $\mu_2 = 0.79$, the regret is obtained by averaging over $10^5$ realizations.
Figure 5: Temporal evolution of the regret for 2-armed bandit with Bernoulli rewards for close mean parameters. In blue AIM, in red Thompson sampling. Arm mean reward values are fixed with $\mu_1 = 0.8$ and $\mu_2 = 0.79$, the regret is obtained by averaging over $10^5$ realizations. Confidence intervals shows the standard deviation.
...and 3 more figures

Theorems & Definitions (13)

Theorem 1
Theorem 2
proof
Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
Lemma 4
...and 3 more

Approximate information maximization for bandit games

TL;DR

Abstract

Approximate information maximization for bandit games

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (13)