Table of Contents
Fetching ...

Neural Value Iteration

Yang You, Ufuk Çakır, Alex Schutz, Robert Skilton, Nick Hawes

TL;DR

This work tackles the scalability barrier of offline POMDP planning by exploiting the PWLC structure to represent the value function as a finite set of neural networks, each encoding an α-vector. It introduces the Finite Network Controller (FNC) and Neural Value Iteration (NVI), which perform Bellman backups on neural α-vectors, enabling high-performance planning in domains with hundreds of millions of states (e.g., RockSample$(20,20)$). Empirically, NVI matches or outperforms existing offline methods on challenging benchmarks, offering near-optimal policies with compact representations and avoiding full belief-space discretization. The results suggest a viable path toward deep offline POMDP planning that scales to real-world, large-scale problems while preserving theoretical value-iteration foundations.

Abstract

The value function of a POMDP exhibits the piecewise-linear-convex (PWLC) property and can be represented as a finite set of hyperplanes, known as $α$-vectors. Most state-of-the-art POMDP solvers (offline planners) follow the point-based value iteration scheme, which performs Bellman backups on $α$-vectors at reachable belief points until convergence. However, since each $α$-vector is $|S|$-dimensional, these methods quickly become intractable for large-scale problems due to the prohibitive computational cost of Bellman backups. In this work, we demonstrate that the PWLC property allows a POMDP's value function to be alternatively represented as a finite set of neural networks. This insight enables a novel POMDP planning algorithm called \emph{Neural Value Iteration}, which combines the generalization capability of neural networks with the classical value iteration framework. Our approach achieves near-optimal solutions even in extremely large POMDPs that are intractable for existing offline solvers.

Neural Value Iteration

TL;DR

This work tackles the scalability barrier of offline POMDP planning by exploiting the PWLC structure to represent the value function as a finite set of neural networks, each encoding an α-vector. It introduces the Finite Network Controller (FNC) and Neural Value Iteration (NVI), which perform Bellman backups on neural α-vectors, enabling high-performance planning in domains with hundreds of millions of states (e.g., RockSample). Empirically, NVI matches or outperforms existing offline methods on challenging benchmarks, offering near-optimal policies with compact representations and avoiding full belief-space discretization. The results suggest a viable path toward deep offline POMDP planning that scales to real-world, large-scale problems while preserving theoretical value-iteration foundations.

Abstract

The value function of a POMDP exhibits the piecewise-linear-convex (PWLC) property and can be represented as a finite set of hyperplanes, known as -vectors. Most state-of-the-art POMDP solvers (offline planners) follow the point-based value iteration scheme, which performs Bellman backups on -vectors at reachable belief points until convergence. However, since each -vector is -dimensional, these methods quickly become intractable for large-scale problems due to the prohibitive computational cost of Bellman backups. In this work, we demonstrate that the PWLC property allows a POMDP's value function to be alternatively represented as a finite set of neural networks. This insight enables a novel POMDP planning algorithm called \emph{Neural Value Iteration}, which combines the generalization capability of neural networks with the classical value iteration framework. Our approach achieves near-optimal solutions even in extremely large POMDPs that are intractable for existing offline solvers.

Paper Structure

This paper contains 24 sections, 4 equations, 5 figures, 3 tables, 4 algorithms.

Figures (5)

  • Figure 1: A finite-state controller (FSC) vs. a finite-network controller (FNC). Each FNC node stores a neural network that explicitly approximates an $\alpha$-vector.
  • Figure 2: NBB’s upper and lower bounds with backup numbers on RS (20,20), Light Dark, and Lidar Roomba. Baseline online planning methods despotpomcpowcai2021hypadaops are reported from BetaZero moss2023betazero.
  • Figure 3: NVI analysis: (a) Effect of $nb_{\text{sample}}$ on RS; (b, c) Comparison with POMCGS in FSC size and planning time on the Light Dark domain. NVI’s FNC is converted to a FSC.
  • Figure 4: Actor critic architecture similar as tao2025pobax
  • Figure 5: Training-progress curves for all evaluated environments.