Disentangling Exploration from Exploitation

Alessandro Lizzeri; Eran Shmaya; Leeat Yariv

Disentangling Exploration from Exploitation

Alessandro Lizzeri, Eran Shmaya, Leeat Yariv

TL;DR

This paper studies how to disentangle exploration from exploitation in a two-project Poisson-bandit framework, where learning about potential actions occurs via Poisson signals with rates $\lambda_z^g$ and $\lambda_z^b$ and payoffs depend on the success of each project. It derives a threshold-based optimal policy when one project is safe, showing the critical cutoff $\bar{p}(\alpha)=\dfrac{(r+\lambda(1-\alpha))R_L}{(r+\lambda)R_H-\lambda \alpha R_L}$ and demonstrates how disentanglement ($\alpha<1$) yields higher payoffs, especially at intermediate values of the discount rate $r$ and news arrival rate $\lambda$. For two risky projects, the work shows there is no Gittins-like index governing exploration; optimal exploration depends on a project-specific information value and leads to persistence and limited switching. The results highlight substantial qualitative differences from classical entangled bandits, with the disentangled framework offering meaningful payoff gains in realistic intermediate-parameter regimes and across various news structures (good, bad, balanced). The framework has broad relevance for policy evaluation, portfolio exploration, and dynamic information acquisition where exploration targets are not immediately pursued.

Abstract

Starting from Robbins (1952), the literature on experimentation via multi-armed bandits has wed exploration and exploitation. Nonetheless, in many applications, agents' exploration and exploitation need not be intertwined: a policymaker may assess new policies different than the status quo; an investor may evaluate projects outside her portfolio. We characterize the optimal experimentation policy when exploration and exploitation are disentangled in the case of Poisson bandits, allowing for general news structures. The optimal policy features complete learning asymptotically, exhibits lots of persistence, but cannot be identified by an index a la Gittins. Disentanglement is particularly valuable for intermediate parameter values.

Disentangling Exploration from Exploitation

TL;DR

This paper studies how to disentangle exploration from exploitation in a two-project Poisson-bandit framework, where learning about potential actions occurs via Poisson signals with rates

and

and payoffs depend on the success of each project. It derives a threshold-based optimal policy when one project is safe, showing the critical cutoff

and demonstrates how disentanglement (

) yields higher payoffs, especially at intermediate values of the discount rate

and news arrival rate

. For two risky projects, the work shows there is no Gittins-like index governing exploration; optimal exploration depends on a project-specific information value and leads to persistence and limited switching. The results highlight substantial qualitative differences from classical entangled bandits, with the disentangled framework offering meaningful payoff gains in realistic intermediate-parameter regimes and across various news structures (good, bad, balanced). The framework has broad relevance for policy evaluation, portfolio exploration, and dynamic information acquisition where exploration targets are not immediately pursued.

Abstract

Paper Structure (16 sections, 7 theorems, 31 equations, 3 figures)

This paper contains 16 sections, 7 theorems, 31 equations, 3 figures.

Introduction
Related Literature
The Model
One Safe Project
Optimal Policy with a Safe Project
Payoff Consequences of Disentanglement
Two Risky Projects
Balanced News Settings
No Exploration Index
Good News Settings
Bad News Settings
Concluding Remarks
Appendix
Preliminaries
One Safe Project: Proofs and Additional Analysis
...and 1 more sections

Key Result

Proposition 0

For all $\alpha<1$, the agent exploits the best project asymptotically.

Figures (3)

Figure 1: Payoff value of disentanglement for (a) pure good news settings, and (b) pure bad news settings when $R_{L}=10$, $R_{H}=15$, and $\lambda_{H}=5$
Figure 2: Optimal policy with two risky projects in good news settings
Figure 3: Optimal policy with two risky projects in bad news settings

Theorems & Definitions (17)

Proposition 0: Asymptotic Optimality
Proposition 1: One Safe Project: Optimal Exploitation
Corollary 1: One Safe Project: Comparative Statics
Proposition 2: Optimal Exploration in Balanced News Settings
Corollary 2: No Exploration Index
Proposition 3: Optimal Exploration in Good News Settings
Claim 1: Uniqueness of Optimal Policy
Claim 2: Initial Choice with Pure Good News
Proposition 4: Optimal Exploration in Bad News Settings
proof : Proof of Proposition \ref{['prop0']}
...and 7 more

Disentangling Exploration from Exploitation

TL;DR

Abstract

Disentangling Exploration from Exploitation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (17)