Evolution of Information in Interactive Decision Making: A Case Study for Multi-Armed Bandits
Yuzhou Gu, Yanjun Han, Jian Qian
TL;DR
This work analyzes an interactive stochastic MAB toy where the best arm is uniformly better by a fixed gap $\Delta$, and characterizes the optimal learning performance through $p_t^\star$ and mutual information $I_t^\star$. It reveals four distinct information-growth regimes tied to the effective pulls $t\Delta^2$, including early linear, quadratic, stabilized, and saturated phases, and shows that optimal learning can occur without maximizing information gain. A simple SPRT-based algorithm achieves the stated achievability bounds, while novel converse techniques based on change-of-divergence and reduction arguments establish tight upper bounds, including a surprising looseness of Fano-type relations in intermediate regimes. The results demonstrate a fundamental separation between learning and information accumulation in interactive decision making and discuss stopping-time vs fixed-budget distinctions, offering new insights for information-theoretic analyses in interactive environments.
Abstract
We study the evolution of information in interactive decision making through the lens of a stochastic multi-armed bandit problem. Focusing on a fundamental example where a unique optimal arm outperforms the rest by a fixed margin, we characterize the optimal success probability and mutual information over time. Our findings reveal distinct growth phases in mutual information -- initially linear, transitioning to quadratic, and finally returning to linear -- highlighting curious behavioral differences between interactive and non-interactive environments. In particular, we show that optimal success probability and mutual information can be decoupled, where achieving optimal learning does not necessarily require maximizing information gain. These findings shed new light on the intricate interplay between information and learning in interactive decision making.
