Exploration Unbound
Dilip Arumugam, Wanqiao Xu, Benjamin Van Roy
TL;DR
This work tackles sequential decision-making under the challenge of infinite action spaces and unbounded rewards, showing that traditional tapering of exploration is suboptimal. By formalizing a representative complex bandit with a curricular structure and unbounded rewards, the authors prove that optimal policies must randomize between exploration and exploitation rather than fixating on one strategy, and they establish horizon-dependent optimal mixing with a conjectured limit $p_T^*\to (\alpha+1)/(\alpha+\tau)$ as $T\to\infty$. A concrete instantiation demonstrates that pure exploration or pure exploitation is never discounted-overtaking optimal, and a nonzero exploration probability persists indefinitely. The work further connects to practical agent design via learning targets and rate-distortion Thompson Sampling, suggesting principled mechanisms to manage exploration in real-world, large-scale systems such as LLMs and continual RL settings.
Abstract
A sequential decision-making agent balances between exploring to gain new knowledge about an environment and exploiting current knowledge to maximize immediate reward. For environments studied in the traditional literature, optimal decisions gravitate over time toward exploitation as the agent accumulates sufficient knowledge and the benefits of further exploration vanish. What if, however, the environment offers an unlimited amount of useful knowledge and there is large benefit to further exploration no matter how much the agent has learned? We offer a simple, quintessential example of such a complex environment. In this environment, rewards are unbounded and an agent can always increase the rate at which rewards accumulate by exploring to learn more. Consequently, an optimal agent forever maintains a propensity to explore.
