Table of Contents
Fetching ...

Theoretical Barriers in Bellman-Based Reinforcement Learning

Brieuc Pinon, Raphaël Jungers, Jean-Charles Delvenne

TL;DR

The paper formalizes fundamental inefficiencies in Bellman-equation–based reinforcement learning when learning over aggregated problem instances, using CNF-SAT as a minimal, structured domain. It proves that learning value functions with the Bellman update (and similarly learning universal value functions for state-to-state reachability via HER) can require exponential time on aggregated subproblems due to noninformative failure feedback. By contrast, a resolution-based SAT solver can exploit aggregation structure without incurring such cost, suggesting that Bellman-based approaches may need richer feedback or alternative decompositional strategies. The results illuminate a theoretical barrier for common RL planning paradigms and motivate developing algorithms that leverage failure information more effectively. Overall, the work highlights how problem aggregation can obscure subproblem structure from Bellman-based learners, with implications for Automated Theorem Proving and related domains.

Abstract

Reinforcement Learning algorithms designed for high-dimensional spaces often enforce the Bellman equation on a sampled subset of states, relying on generalization to propagate knowledge across the state space. In this paper, we identify and formalize a fundamental limitation of this common approach. Specifically, we construct counterexample problems with a simple structure that this approach fails to exploit. Our findings reveal that such algorithms can neglect critical information about the problems, leading to inefficiencies. Furthermore, we extend this negative result to another approach from the literature: Hindsight Experience Replay learning state-to-state reachability.

Theoretical Barriers in Bellman-Based Reinforcement Learning

TL;DR

The paper formalizes fundamental inefficiencies in Bellman-equation–based reinforcement learning when learning over aggregated problem instances, using CNF-SAT as a minimal, structured domain. It proves that learning value functions with the Bellman update (and similarly learning universal value functions for state-to-state reachability via HER) can require exponential time on aggregated subproblems due to noninformative failure feedback. By contrast, a resolution-based SAT solver can exploit aggregation structure without incurring such cost, suggesting that Bellman-based approaches may need richer feedback or alternative decompositional strategies. The results illuminate a theoretical barrier for common RL planning paradigms and motivate developing algorithms that leverage failure information more effectively. Overall, the work highlights how problem aggregation can obscure subproblem structure from Bellman-based learners, with implications for Automated Theorem Proving and related domains.

Abstract

Reinforcement Learning algorithms designed for high-dimensional spaces often enforce the Bellman equation on a sampled subset of states, relying on generalization to propagate knowledge across the state space. In this paper, we identify and formalize a fundamental limitation of this common approach. Specifically, we construct counterexample problems with a simple structure that this approach fails to exploit. Our findings reveal that such algorithms can neglect critical information about the problems, leading to inefficiencies. Furthermore, we extend this negative result to another approach from the literature: Hindsight Experience Replay learning state-to-state reachability.

Paper Structure

This paper contains 19 sections, 5 theorems, 4 equations, 1 figure, 3 algorithms.

Key Result

Theorem 3.6

Let $p$ be a CNF-SAT instance over $n$ variables, constructed by an aggregation of CNF-SAT instances $p_1,\ldots,p_K$ using index lists $I_1,\ldots,I_k,\ldots,I_K$, where the first element of each $I_k$ is $k$. Let $V$ and $V_1,\ldots,V_K$ represent sets of value functions, and let $v^*$ denote an o Under these assumptions, Algorithm alg:BE_search, initialized with $V^0=V$ and $p=p$, runs for an e

Figures (1)

  • Figure 1: Representation of a counterexample problem. Independent sub-problems $p_0,\ldots,p_K$ are aggregated into a single composite problem $p$, where the first variable of each sub-problem is mapped at the start of the new problem. Under appropriate assumptions, Theorem \ref{['thm:BE']} states that this construction forces a Bellman equation-based algorithm (Algorithm \ref{['alg:BE_search']}) to have an exponential runtime in the number of sub-problems $K$.

Theorems & Definitions (19)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Definition 3.4
  • Definition 3.5
  • Theorem 3.6
  • Definition 4.1
  • Definition 4.2
  • Definition 4.3
  • Definition 4.4
  • ...and 9 more