Table of Contents
Fetching ...

Finite-State Controllers for (Hidden-Model) POMDPs using Deep Reinforcement Learning

David Hudák, Maris F. L. Galesloot, Martin Tappler, Martin Kurečka, Nils Jansen, Milan Češka

TL;DR

The paper tackles scalability and verifiability in POMDP planning by combining deep reinforcement learning with formal policy extraction. It introduces Lexpop, which trains neural policies via DRL and then extracts finite-state controllers (FSCs) that can be formally evaluated, extending the approach to HM-POMDPs for worst-case robustness. Two FSC extraction methods are proposed: automata learning (Alergia) and Self-Interpretable Networks (SIG), together with a verification framework (PAYNT/Storm) to guarantee performance bounds. Experiments on large state-space problems show that Lexpop outperforms state-of-the-art solvers in both standard and robust HM-POMDP settings, demonstrating scalable, verifiable planning in challenging domains.

Abstract

Solving partially observable Markov decision processes (POMDPs) requires computing policies under imperfect state information. Despite recent advances, the scalability of existing POMDP solvers remains limited. Moreover, many settings require a policy that is robust across multiple POMDPs, further aggravating the scalability issue. We propose the Lexpop framework for POMDP solving. Lexpop (1) employs deep reinforcement learning to train a neural policy, represented by a recurrent neural network, and (2) constructs a finite-state controller mimicking the neural policy through efficient extraction methods. Crucially, unlike neural policies, such controllers can be formally evaluated, providing performance guarantees. We extend Lexpop to compute robust policies for hidden-model POMDPs (HM-POMDPs), which describe finite sets of POMDPs. We associate every extracted controller with its worst-case POMDP. Using a set of such POMDPs, we iteratively train a robust neural policy and consequently extract a robust controller. Our experiments show that on problems with large state spaces, Lexpop outperforms state-of-the-art solvers for POMDPs as well as HM-POMDPs.

Finite-State Controllers for (Hidden-Model) POMDPs using Deep Reinforcement Learning

TL;DR

The paper tackles scalability and verifiability in POMDP planning by combining deep reinforcement learning with formal policy extraction. It introduces Lexpop, which trains neural policies via DRL and then extracts finite-state controllers (FSCs) that can be formally evaluated, extending the approach to HM-POMDPs for worst-case robustness. Two FSC extraction methods are proposed: automata learning (Alergia) and Self-Interpretable Networks (SIG), together with a verification framework (PAYNT/Storm) to guarantee performance bounds. Experiments on large state-space problems show that Lexpop outperforms state-of-the-art solvers in both standard and robust HM-POMDP settings, demonstrating scalable, verifiable planning in challenging domains.

Abstract

Solving partially observable Markov decision processes (POMDPs) requires computing policies under imperfect state information. Despite recent advances, the scalability of existing POMDP solvers remains limited. Moreover, many settings require a policy that is robust across multiple POMDPs, further aggravating the scalability issue. We propose the Lexpop framework for POMDP solving. Lexpop (1) employs deep reinforcement learning to train a neural policy, represented by a recurrent neural network, and (2) constructs a finite-state controller mimicking the neural policy through efficient extraction methods. Crucially, unlike neural policies, such controllers can be formally evaluated, providing performance guarantees. We extend Lexpop to compute robust policies for hidden-model POMDPs (HM-POMDPs), which describe finite sets of POMDPs. We associate every extracted controller with its worst-case POMDP. Using a set of such POMDPs, we iteratively train a robust neural policy and consequently extract a robust controller. Our experiments show that on problems with large state spaces, Lexpop outperforms state-of-the-art solvers for POMDPs as well as HM-POMDPs.
Paper Structure (58 sections, 7 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 58 sections, 7 equations, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: The overall idea of Lexpop, our RL-based FSC extraction. The robust extensions are depicted in green.
  • Figure 2: High-level architecture of the self-interpretable Gumbel softmax network.
  • Figure 3: Selected convergence curves showing average values and IQR (if applicable): (a) Large single-POMDP benchmarks: Empirical probabilities of the neural policies (dashed lines), and the values of the FSC produced by Saynt (crosses) and extracted from the final neural policy (solid lines). (b,c) Performance loss of the FSC extraction: Empirical value of the neural policy on the set of POMDPs used for training (dashed lines) and the robust value of the extracted FSCs in the whole HM-POMDPs (solid lines). Additionally, in (c), the orange line shows the empirical value of the extracted FSCs on the set of POMDPs used for training, and the black line shows the empirical value achieved by the same self-interpretable network without the quantized layers on the same set of POMDPs. (d) Impact of the worst-case POMDP selection in HM-POMDPs: Lines as in (b). The green lines show the results using the worst-case POMDP selection, while the red lines show the results for random selection. Additional convergence curves are in \ref{['appendix:extended_results']}.
  • Figure 4: The architecture of our Gumbel Softmax self-interpretable network. The upper part of each block describes the architecture of a layer, the second number represents the size of the layer, where $k$ is the upper bound of memory, and $\text{acts}$ is the number of playable actions for a given model. During the training, we use the output distribution from the Gumbel layer, while in the case of the final controller construction, we use a one-hot vector representing the sampled memory.
  • Figure 5: Figures for single-POMDP models. The yellow line shows the RL performance over time, red crosses the values provided by Saynt over time, the blue line the final verified value extracted by SIG, and the green line the verified value extracted by Alergia.
  • ...and 2 more figures

Theorems & Definitions (5)

  • Definition 1: MDP
  • Definition 2: POMDP
  • Definition 3: FSC
  • Definition 4: HM-POMDP
  • Definition 5: Robust value