Necessary and Sufficient Oracles: Toward a Computational Taxonomy For Reinforcement Learning
Dhruv Rohatgi, Dylan J. Foster
TL;DR
This work investigates the minimal computational tools necessary for oracle-efficient reinforcement learning in rich observation settings. It establishes two-context regression as a minimal, necessary-and-sufficient oracle for reward-free episodic RL in Block MDPs under a mild regularity condition, and demonstrates that reset access reduces the required strength to one-context regression. It further proves a qualitative computational separation between Block MDPs and Generalized/Low-Rank MDPs via cryptographic hardness, arguing that weight function realizability plays a key role. The results are instantiated with concrete reductions and algorithms (PCE, EPCE, PCR, EPCR, NoiselessNoRed, TwoRed, TwoAug) that connect regression oracles to policy-cover constructions across horizon layers. Overall, the paper contributes a computational taxonomy for RL by clarifying how different oracle assumptions affect the tractability of exploration in structured RL models and highlighting directions for future theory and algorithm design. The findings have implications for designing principled RL systems that leverage suitable supervised-learning subroutines in large-state environments.
Abstract
Algorithms for reinforcement learning (RL) in large state spaces crucially rely on supervised learning subroutines to estimate objects such as value functions or transition probabilities. Since only the simplest supervised learning problems can be solved provably and efficiently, practical performance of an RL algorithm depends on which of these supervised learning "oracles" it assumes access to (and how they are implemented). But which oracles are better or worse? Is there a minimal oracle? In this work, we clarify the impact of the choice of supervised learning oracle on the computational complexity of RL, as quantified by the oracle strength. First, for the task of reward-free exploration in Block MDPs in the standard episodic access model -- a ubiquitous setting for RL with function approximation -- we identify two-context regression as a minimal oracle, i.e. an oracle that is both necessary and sufficient (under a mild regularity assumption). Second, we identify one-context regression as a near-minimal oracle in the stronger reset access model, establishing a provable computational benefit of resets in the process. Third, we broaden our focus to Low-Rank MDPs, where we give cryptographic evidence that the analogous oracle from the Block MDP setting is insufficient.
