Disentangling Exploration from Exploitation
Alessandro Lizzeri, Eran Shmaya, Leeat Yariv
TL;DR
This paper studies how to disentangle exploration from exploitation in a two-project Poisson-bandit framework, where learning about potential actions occurs via Poisson signals with rates $\lambda_z^g$ and $\lambda_z^b$ and payoffs depend on the success of each project. It derives a threshold-based optimal policy when one project is safe, showing the critical cutoff $\bar{p}(\alpha)=\dfrac{(r+\lambda(1-\alpha))R_L}{(r+\lambda)R_H-\lambda \alpha R_L}$ and demonstrates how disentanglement ($\alpha<1$) yields higher payoffs, especially at intermediate values of the discount rate $r$ and news arrival rate $\lambda$. For two risky projects, the work shows there is no Gittins-like index governing exploration; optimal exploration depends on a project-specific information value and leads to persistence and limited switching. The results highlight substantial qualitative differences from classical entangled bandits, with the disentangled framework offering meaningful payoff gains in realistic intermediate-parameter regimes and across various news structures (good, bad, balanced). The framework has broad relevance for policy evaluation, portfolio exploration, and dynamic information acquisition where exploration targets are not immediately pursued.
Abstract
Starting from Robbins (1952), the literature on experimentation via multi-armed bandits has wed exploration and exploitation. Nonetheless, in many applications, agents' exploration and exploitation need not be intertwined: a policymaker may assess new policies different than the status quo; an investor may evaluate projects outside her portfolio. We characterize the optimal experimentation policy when exploration and exploitation are disentangled in the case of Poisson bandits, allowing for general news structures. The optimal policy features complete learning asymptotically, exhibits lots of persistence, but cannot be identified by an index a la Gittins. Disentanglement is particularly valuable for intermediate parameter values.
