Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning

Noah Golowich; Ankur Moitra; Dhruv Rohatgi

Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning

Noah Golowich, Ankur Moitra, Dhruv Rohatgi

TL;DR

The paper establishes the first cryptographic separation showing reward-free reinforcement learning in a family of block MDPs can be computationally harder than realizable regression over the same decoding class, under a refined Learning Parities with Noise assumption. The approach builds a bridge from static LPN hardness to dynamic RL via a counter-MDP and carefully designed emissions that encode latent information, yet resist efficient regression without exploiting nontrivial structure. A key technical ingredient is batch LPN with weakly dependent noise, which preserves hardness despite trajectory dependencies, together with an additive-homomorphic framework enabling trajectory simulation without leaking latent labels. The results yield oracle-lower bounds demonstrating that standard regression oracles are insufficient for reward-free RL in these constructions, and they delineate the limits of minimal oracles for RL in block MDPs. The work also clarifies when RL might be tractable, by showing where standard regression-based reductions do apply (offline RL, horizon-1 block MDPs, deterministic dynamics), and by outlining several open directions for tightening the separation and extending to reward-directed RL.

Abstract

Supervised learning is often computationally easy in practice. But to what extent does this mean that other modes of learning, such as reinforcement learning (RL), ought to be computationally easy by extension? In this work we show the first cryptographic separation between RL and supervised learning, by exhibiting a class of block MDPs and associated decoding functions where reward-free exploration is provably computationally harder than the associated regression problem. We also show that there is no computationally efficient algorithm for reward-directed RL in block MDPs, even when given access to an oracle for this regression problem. It is known that being able to perform regression in block MDPs is necessary for finding a good policy; our results suggest that it is not sufficient. Our separation lower bound uses a new robustness property of the Learning Parities with Noise (LPN) hardness assumption, which is crucial in handling the dependent nature of RL data. We argue that separations and oracle lower bounds, such as ours, are a more meaningful way to prove hardness of learning because the constructions better reflect the practical reality that supervised learning by itself is often not the computational bottleneck.

Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning

TL;DR

Abstract

Paper Structure (108 sections, 62 theorems, 293 equations, 2 figures, 1 table, 4 algorithms)

This paper contains 108 sections, 62 theorems, 293 equations, 2 figures, 1 table, 4 algorithms.

Introduction
Background: Supervised learning and reinforcement learning
Supervised learning.
Episodic reinforcement learning (RL).
Is RL in block MDPs harder than regression?
Discussion: exploration versus prediction.
Main result: a cryptographic separation
Our cryptographic toolbox.
Oracle-efficiency in reinforcement learning
An oracle lower bound.
So is there a minimal oracle?
Discussion: what makes RL tractable?
Technical overview
Proof techniques I: a warm-up separation
A horizon-two block MDP.
...and 93 more sections

Key Result

Theorem 1.4

Under assm:fine-lpn, for any constant $C>0$, there is a block MDP family $\mathcal{M}$ for which the time complexity of reward-free reinforcement learning (def:rf-rl) is larger than that of $1/(HAS)^C$-accurate $\mathcal{M}$-realizable regression (def:regression-algorithm), by a multiplicative facto

Figures (2)

Figure 1: The horizon-two latent MDP
Figure 2: Diagram of the separation between realizable regression and strong reward-free RL under \ref{['assm:fine-lpn']}. Here, the inequalities refer to polynomial-time reducibility. The starred inequality is only true for an idealized encryption scheme $(\mathtt{Enc},\mathtt{Dec})$, but can be made rigorous with an explicit LPN-based encryption scheme and a modification of \ref{['problem:self-supervised']} . See \ref{['sec:warmup-overview']}.

Theorems & Definitions (158)

Definition 1.1
Definition 1.2: Informal; see \ref{['sec:block-rl']}
Theorem 1.4: Informal version of \ref{['thm:main-separation']}
Remark 1.5
Lemma 1.6: Informal statement of \ref{['lemma:construct-corr-lpn']}
Theorem 1.7: Informal statement of \ref{['thm:reduction-prp']}
Proposition 2.1
proof : Proof sketch
Proposition 2.2
proof : Proof sketch
...and 148 more

Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning

TL;DR

Abstract

Exploration is Harder than Prediction: Cryptographically Separating Reinforcement Learning from Supervised Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (158)