Theoretical Barriers in Bellman-Based Reinforcement Learning
Brieuc Pinon, Raphaël Jungers, Jean-Charles Delvenne
TL;DR
The paper formalizes fundamental inefficiencies in Bellman-equation–based reinforcement learning when learning over aggregated problem instances, using CNF-SAT as a minimal, structured domain. It proves that learning value functions with the Bellman update (and similarly learning universal value functions for state-to-state reachability via HER) can require exponential time on aggregated subproblems due to noninformative failure feedback. By contrast, a resolution-based SAT solver can exploit aggregation structure without incurring such cost, suggesting that Bellman-based approaches may need richer feedback or alternative decompositional strategies. The results illuminate a theoretical barrier for common RL planning paradigms and motivate developing algorithms that leverage failure information more effectively. Overall, the work highlights how problem aggregation can obscure subproblem structure from Bellman-based learners, with implications for Automated Theorem Proving and related domains.
Abstract
Reinforcement Learning algorithms designed for high-dimensional spaces often enforce the Bellman equation on a sampled subset of states, relying on generalization to propagate knowledge across the state space. In this paper, we identify and formalize a fundamental limitation of this common approach. Specifically, we construct counterexample problems with a simple structure that this approach fails to exploit. Our findings reveal that such algorithms can neglect critical information about the problems, leading to inefficiencies. Furthermore, we extend this negative result to another approach from the literature: Hindsight Experience Replay learning state-to-state reachability.
