Exploration Unbound

Dilip Arumugam; Wanqiao Xu; Benjamin Van Roy

Exploration Unbound

Dilip Arumugam, Wanqiao Xu, Benjamin Van Roy

TL;DR

This work tackles sequential decision-making under the challenge of infinite action spaces and unbounded rewards, showing that traditional tapering of exploration is suboptimal. By formalizing a representative complex bandit with a curricular structure and unbounded rewards, the authors prove that optimal policies must randomize between exploration and exploitation rather than fixating on one strategy, and they establish horizon-dependent optimal mixing with a conjectured limit $p_T^*\to (\alpha+1)/(\alpha+\tau)$ as $T\to\infty$. A concrete instantiation demonstrates that pure exploration or pure exploitation is never discounted-overtaking optimal, and a nonzero exploration probability persists indefinitely. The work further connects to practical agent design via learning targets and rate-distortion Thompson Sampling, suggesting principled mechanisms to manage exploration in real-world, large-scale systems such as LLMs and continual RL settings.

Abstract

A sequential decision-making agent balances between exploring to gain new knowledge about an environment and exploiting current knowledge to maximize immediate reward. For environments studied in the traditional literature, optimal decisions gravitate over time toward exploitation as the agent accumulates sufficient knowledge and the benefits of further exploration vanish. What if, however, the environment offers an unlimited amount of useful knowledge and there is large benefit to further exploration no matter how much the agent has learned? We offer a simple, quintessential example of such a complex environment. In this environment, rewards are unbounded and an agent can always increase the rate at which rewards accumulate by exploring to learn more. Consequently, an optimal agent forever maintains a propensity to explore.

Exploration Unbound

TL;DR

. A concrete instantiation demonstrates that pure exploration or pure exploitation is never discounted-overtaking optimal, and a nonzero exploration probability persists indefinitely. The work further connects to practical agent design via learning targets and rate-distortion Thompson Sampling, suggesting principled mechanisms to manage exploration in real-world, large-scale systems such as LLMs and continual RL settings.

Abstract

Paper Structure (13 sections, 3 theorems, 28 equations, 1 figure)

This paper contains 13 sections, 3 theorems, 28 equations, 1 figure.

Introduction
Problem Formulation
Necessity of Randomized Exploration
Discussion
Prior Work
Towards Practical Agent Design
Conclusion
Computational Experiment
Analysis
Proof of Theorem \ref{['thm:exploit']}
Proof of Theorem \ref{['thm:explore']}
Proof of Theorem \ref{['thm:regret']}
Proof Roadmap for Conjecture \ref{['conj:convergence']}

Key Result

Theorem 1

In Example xample:curricular, an agent that always explores is never discounted-overtaking optimal.

Figures (1)

Figure 1: Cumulative regret curve comparing Thompson Sampling and Rate-Distortion Thompson Sampling agents for learning the first two digits of $\pi$.

Theorems & Definitions (9)

Example 1
Theorem 1
Theorem 2
Theorem 3
Conjecture 1
proof
proof
proof
proof : Proof roadmap

Exploration Unbound

TL;DR

Abstract

Exploration Unbound

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (9)