Is Bellman Equation Enough for Learning Control?
Haoxiang You, Lekan Molu, Ian Abraham
TL;DR
This work shows that in continuous-state control the Bellman/HJB equations do not uniquely identify the value function, with the linear CT-LQR case admitting at least $\binom{2n}{n}$ solutions but only one stabilizing, creating a fundamental challenge for value-based learning. It links the multiple solutions to invariant subspaces of the Hamiltonian and the algebraic Riccati equation, showing that most candidate solutions induce unstable closed-loop dynamics. To counter this, the authors propose two strategies: (i) boundary-condition tricks to isolate the stable solution, and (ii) a positive-definite neural architecture that guarantees convergence to the stable solution by construction, effectively enforcing a Lyapunov-like criterion during learning. They demonstrate the approach on nonlinear tasks such as cart-pole and drone control, where the positive-definite architecture yields stable behavior while standard networks can converge to unstable solutions, and discuss limitations and connections to related work on uniqueness and Lyapunov-based control. The work highlights a practical barrier for value-based RL in continuous spaces and provides a principled architectural remedy with implications for safer, more reliable learning-based control.
Abstract
The Bellman equation and its continuous-time counterpart, the Hamilton-Jacobi-Bellman (HJB) equation, serve as necessary conditions for optimality in reinforcement learning and optimal control. While the value function is known to be the unique solution to the Bellman equation in tabular settings, we demonstrate that this uniqueness fails to hold in continuous state spaces. Specifically, for linear dynamical systems, we prove the Bellman equation admits at least $\binom{2n}{n}$ solutions, where $n$ is the state dimension. Crucially, only one of these solutions yields both an optimal policy and a stable closed-loop system. We then demonstrate a common failure mode in value-based methods: convergence to unstable solutions due to the exponential imbalance between admissible and inadmissible solutions. Finally, we introduce a positive-definite neural architecture that guarantees convergence to the stable solution by construction to address this issue.
