Table of Contents
Fetching ...

Is Bellman Equation Enough for Learning Control?

Haoxiang You, Lekan Molu, Ian Abraham

TL;DR

This work shows that in continuous-state control the Bellman/HJB equations do not uniquely identify the value function, with the linear CT-LQR case admitting at least $\binom{2n}{n}$ solutions but only one stabilizing, creating a fundamental challenge for value-based learning. It links the multiple solutions to invariant subspaces of the Hamiltonian and the algebraic Riccati equation, showing that most candidate solutions induce unstable closed-loop dynamics. To counter this, the authors propose two strategies: (i) boundary-condition tricks to isolate the stable solution, and (ii) a positive-definite neural architecture that guarantees convergence to the stable solution by construction, effectively enforcing a Lyapunov-like criterion during learning. They demonstrate the approach on nonlinear tasks such as cart-pole and drone control, where the positive-definite architecture yields stable behavior while standard networks can converge to unstable solutions, and discuss limitations and connections to related work on uniqueness and Lyapunov-based control. The work highlights a practical barrier for value-based RL in continuous spaces and provides a principled architectural remedy with implications for safer, more reliable learning-based control.

Abstract

The Bellman equation and its continuous-time counterpart, the Hamilton-Jacobi-Bellman (HJB) equation, serve as necessary conditions for optimality in reinforcement learning and optimal control. While the value function is known to be the unique solution to the Bellman equation in tabular settings, we demonstrate that this uniqueness fails to hold in continuous state spaces. Specifically, for linear dynamical systems, we prove the Bellman equation admits at least $\binom{2n}{n}$ solutions, where $n$ is the state dimension. Crucially, only one of these solutions yields both an optimal policy and a stable closed-loop system. We then demonstrate a common failure mode in value-based methods: convergence to unstable solutions due to the exponential imbalance between admissible and inadmissible solutions. Finally, we introduce a positive-definite neural architecture that guarantees convergence to the stable solution by construction to address this issue.

Is Bellman Equation Enough for Learning Control?

TL;DR

This work shows that in continuous-state control the Bellman/HJB equations do not uniquely identify the value function, with the linear CT-LQR case admitting at least solutions but only one stabilizing, creating a fundamental challenge for value-based learning. It links the multiple solutions to invariant subspaces of the Hamiltonian and the algebraic Riccati equation, showing that most candidate solutions induce unstable closed-loop dynamics. To counter this, the authors propose two strategies: (i) boundary-condition tricks to isolate the stable solution, and (ii) a positive-definite neural architecture that guarantees convergence to the stable solution by construction, effectively enforcing a Lyapunov-like criterion during learning. They demonstrate the approach on nonlinear tasks such as cart-pole and drone control, where the positive-definite architecture yields stable behavior while standard networks can converge to unstable solutions, and discuss limitations and connections to related work on uniqueness and Lyapunov-based control. The work highlights a practical barrier for value-based RL in continuous spaces and provides a principled architectural remedy with implications for safer, more reliable learning-based control.

Abstract

The Bellman equation and its continuous-time counterpart, the Hamilton-Jacobi-Bellman (HJB) equation, serve as necessary conditions for optimality in reinforcement learning and optimal control. While the value function is known to be the unique solution to the Bellman equation in tabular settings, we demonstrate that this uniqueness fails to hold in continuous state spaces. Specifically, for linear dynamical systems, we prove the Bellman equation admits at least solutions, where is the state dimension. Crucially, only one of these solutions yields both an optimal policy and a stable closed-loop system. We then demonstrate a common failure mode in value-based methods: convergence to unstable solutions due to the exponential imbalance between admissible and inadmissible solutions. Finally, we introduce a positive-definite neural architecture that guarantees convergence to the stable solution by construction to address this issue.

Paper Structure

This paper contains 37 sections, 16 theorems, 92 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

The value function $\mathcal{V}(\mathbf{x}) = \mathbf{x}^\top P \mathbf{x}$ satisfies the HJB equation for the discounted LQR problem eq: discounted LQR if and only if it satisfies the HJB equation for the undiscounted LQR problem with modified dynamics eq: modified LQR.

Figures (7)

  • Figure 1: Solution to LQR: $\mathcal{V}(\mathbf{x}) = \mathbf{x}^\top P \mathbf{x}$, where $P$ is given by \ref{['eq: toy sol']}. The green dot indicates the stable solution, the blue dot marks the unstable solution, and the orange ring represents additional solutions arising from the noninvertibility of $P_1$.
  • Figure 2: Learned solution to LQR. Left figure: the learned value function converges to the unstable solution, up to an additive constant. Right figure: both the learned and analytical unstable solutions yield identical diverging trajectories, whereas the stable solution converges to the origin.
  • Figure 3: Example of insufficient boundary conditions: both solutions share the same boundary value, yet one yields a stable closed-loop system while the other results in instability.
  • Figure 4: Value learning with MLP and positive-definiteness architecture. For each case, we run the experiments with 5 different seeds, reporting the average performance. The shaded area represents the standard deviation across the runs. The TD error is minimized for both architecture, indicating a solution to Bellman equation is found. However, generic MLP architecture cannot distinguish the stable solution from the others leading to high cumulative cost and diverging behavoir
  • Figure 5: Solutions found with different initialization method
  • ...and 2 more figures

Theorems & Definitions (39)

  • Theorem 1
  • proof
  • Definition 1
  • Theorem 2
  • proof
  • Lemma 3
  • proof
  • Theorem 4
  • proof
  • Theorem 5
  • ...and 29 more