Table of Contents
Fetching ...

Disentangling Exploration from Exploitation

Alessandro Lizzeri, Eran Shmaya, Leeat Yariv

TL;DR

This paper studies how to disentangle exploration from exploitation in a two-project Poisson-bandit framework, where learning about potential actions occurs via Poisson signals with rates $\lambda_z^g$ and $\lambda_z^b$ and payoffs depend on the success of each project. It derives a threshold-based optimal policy when one project is safe, showing the critical cutoff $\bar{p}(\alpha)=\dfrac{(r+\lambda(1-\alpha))R_L}{(r+\lambda)R_H-\lambda \alpha R_L}$ and demonstrates how disentanglement ($\alpha<1$) yields higher payoffs, especially at intermediate values of the discount rate $r$ and news arrival rate $\lambda$. For two risky projects, the work shows there is no Gittins-like index governing exploration; optimal exploration depends on a project-specific information value and leads to persistence and limited switching. The results highlight substantial qualitative differences from classical entangled bandits, with the disentangled framework offering meaningful payoff gains in realistic intermediate-parameter regimes and across various news structures (good, bad, balanced). The framework has broad relevance for policy evaluation, portfolio exploration, and dynamic information acquisition where exploration targets are not immediately pursued.

Abstract

Starting from Robbins (1952), the literature on experimentation via multi-armed bandits has wed exploration and exploitation. Nonetheless, in many applications, agents' exploration and exploitation need not be intertwined: a policymaker may assess new policies different than the status quo; an investor may evaluate projects outside her portfolio. We characterize the optimal experimentation policy when exploration and exploitation are disentangled in the case of Poisson bandits, allowing for general news structures. The optimal policy features complete learning asymptotically, exhibits lots of persistence, but cannot be identified by an index a la Gittins. Disentanglement is particularly valuable for intermediate parameter values.

Disentangling Exploration from Exploitation

TL;DR

This paper studies how to disentangle exploration from exploitation in a two-project Poisson-bandit framework, where learning about potential actions occurs via Poisson signals with rates and and payoffs depend on the success of each project. It derives a threshold-based optimal policy when one project is safe, showing the critical cutoff and demonstrates how disentanglement () yields higher payoffs, especially at intermediate values of the discount rate and news arrival rate . For two risky projects, the work shows there is no Gittins-like index governing exploration; optimal exploration depends on a project-specific information value and leads to persistence and limited switching. The results highlight substantial qualitative differences from classical entangled bandits, with the disentangled framework offering meaningful payoff gains in realistic intermediate-parameter regimes and across various news structures (good, bad, balanced). The framework has broad relevance for policy evaluation, portfolio exploration, and dynamic information acquisition where exploration targets are not immediately pursued.

Abstract

Starting from Robbins (1952), the literature on experimentation via multi-armed bandits has wed exploration and exploitation. Nonetheless, in many applications, agents' exploration and exploitation need not be intertwined: a policymaker may assess new policies different than the status quo; an investor may evaluate projects outside her portfolio. We characterize the optimal experimentation policy when exploration and exploitation are disentangled in the case of Poisson bandits, allowing for general news structures. The optimal policy features complete learning asymptotically, exhibits lots of persistence, but cannot be identified by an index a la Gittins. Disentanglement is particularly valuable for intermediate parameter values.
Paper Structure (16 sections, 7 theorems, 31 equations, 3 figures)

This paper contains 16 sections, 7 theorems, 31 equations, 3 figures.

Key Result

Proposition 0

For all $\alpha<1$, the agent exploits the best project asymptotically.

Figures (3)

  • Figure 1: Payoff value of disentanglement for (a) pure good news settings, and (b) pure bad news settings when $R_{L}=10$, $R_{H}=15$, and $\lambda_{H}=5$
  • Figure 2: Optimal policy with two risky projects in good news settings
  • Figure 3: Optimal policy with two risky projects in bad news settings

Theorems & Definitions (17)

  • Proposition 0: Asymptotic Optimality
  • Proposition 1: One Safe Project: Optimal Exploitation
  • Corollary 1: One Safe Project: Comparative Statics
  • Proposition 2: Optimal Exploration in Balanced News Settings
  • Corollary 2: No Exploration Index
  • Proposition 3: Optimal Exploration in Good News Settings
  • Claim 1: Uniqueness of Optimal Policy
  • Claim 2: Initial Choice with Pure Good News
  • Proposition 4: Optimal Exploration in Bad News Settings
  • proof : Proof of Proposition \ref{['prop0']}
  • ...and 7 more