Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning
Noah Golowich, Ankur Moitra, Dhruv Rohatgi
TL;DR
The paper establishes the first cryptographic separation showing reward-free reinforcement learning in a family of block MDPs can be computationally harder than realizable regression over the same decoding class, under a refined Learning Parities with Noise assumption. The approach builds a bridge from static LPN hardness to dynamic RL via a counter-MDP and carefully designed emissions that encode latent information, yet resist efficient regression without exploiting nontrivial structure. A key technical ingredient is batch LPN with weakly dependent noise, which preserves hardness despite trajectory dependencies, together with an additive-homomorphic framework enabling trajectory simulation without leaking latent labels. The results yield oracle-lower bounds demonstrating that standard regression oracles are insufficient for reward-free RL in these constructions, and they delineate the limits of minimal oracles for RL in block MDPs. The work also clarifies when RL might be tractable, by showing where standard regression-based reductions do apply (offline RL, horizon-1 block MDPs, deterministic dynamics), and by outlining several open directions for tightening the separation and extending to reward-directed RL.
Abstract
Supervised learning is often computationally easy in practice. But to what extent does this mean that other modes of learning, such as reinforcement learning (RL), ought to be computationally easy by extension? In this work we show the first cryptographic separation between RL and supervised learning, by exhibiting a class of block MDPs and associated decoding functions where reward-free exploration is provably computationally harder than the associated regression problem. We also show that there is no computationally efficient algorithm for reward-directed RL in block MDPs, even when given access to an oracle for this regression problem. It is known that being able to perform regression in block MDPs is necessary for finding a good policy; our results suggest that it is not sufficient. Our separation lower bound uses a new robustness property of the Learning Parities with Noise (LPN) hardness assumption, which is crucial in handling the dependent nature of RL data. We argue that separations and oracle lower bounds, such as ours, are a more meaningful way to prove hardness of learning because the constructions better reflect the practical reality that supervised learning by itself is often not the computational bottleneck.
