Rao-Blackwellized POMDP Planning

Jiho Lee; Nisar R. Ahmed; Kyle H. Wray; Zachary N. Sunberg

Rao-Blackwellized POMDP Planning

Jiho Lee, Nisar R. Ahmed, Kyle H. Wray, Zachary N. Sunberg

TL;DR

This paper advances planning under uncertainty by introducing RB-POMDP, a framework that uses Rao-Blackwellized Particle Filters to analytically handle tractable state components while sampling the rest, reducing particle requirements and variance. It couples RBPFs with a new online planner, RB-POMCPOW, which employs quadrature-based integration (e.g., Gaussian-Hermite, Smolyak grids) to compute expectations over marginalized states, thereby cutting Monte Carlo tree iterations. Empirical results in a GPS-denied localization task show RBPF with fewer particles can achieve higher ESS and comparable or better cumulative rewards, while RB-POMCPOW with moderate quadrature levels dramatically speeds planning (roughly sevenfold) relative to standard POMCPOW under the same time budget. The findings suggest RB-POMDPs offer scalable, efficient decision-making for high-dimensional POMDPs and complex planning problems.

Abstract

Partially Observable Markov Decision Processes (POMDPs) provide a structured framework for decision-making under uncertainty, but their application requires efficient belief updates. Sequential Importance Resampling Particle Filters (SIRPF), also known as Bootstrap Particle Filters, are commonly used as belief updaters in large approximate POMDP solvers, but they face challenges such as particle deprivation and high computational costs as the system's state dimension grows. To address these issues, this study introduces Rao-Blackwellized POMDP (RB-POMDP) approximate solvers and outlines generic methods to apply Rao-Blackwellization in both belief updates and online planning. We compare the performance of SIRPF and Rao-Blackwellized Particle Filters (RBPF) in a simulated localization problem where an agent navigates toward a target in a GPS-denied environment using POMCPOW and RB-POMCPOW planners. Our results not only confirm that RBPFs maintain accurate belief approximations over time with fewer particles, but, more surprisingly, RBPFs combined with quadrature-based integration improve planning quality significantly compared to SIRPF-based planning under the same computational limits.

Rao-Blackwellized POMDP Planning

TL;DR

Abstract

Paper Structure (17 sections, 13 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 17 sections, 13 equations, 4 figures, 1 table, 2 algorithms.

Introduction
Background
Partially Observable Markov Decision Process (POMDP)
Rao-Blackwellized Particle Filter (RBPF)
Technical Approach
Rao-Blackwell factorization of POMDPs
RBPF Belief Updates
RBPF in Sampling-Based Online Planners
Rao-Blackwellized POMCPOW (RB-POMCPOW)
Experiments
Localization Problem
Filter Performance Testing
Computational Costs and Cumulative Rewards
Discussion
Conclusion
...and 2 more sections

Figures (4)

Figure 1: POMCPOW (left) and RB-POMCPOW (right) Tree Structure Comparison. Each square and larger circle represents an action node and an observation node, respectively. In the POMCPOW tree, particles are shown as black dots while each particle in the RB-POMCPOW tree is associated with a Gaussian distribution. Due to the nature of POMCPOW, these particles form a weighted mixture of beliefs.
Figure 2: POMDP Localization Problem where the agent navigates toward the target. The agent receives noisy measurements from the static landmarks.
Figure 3: $N_{ESS}$ comparison between RBPF with 100 particles and SIRPF with 1000 particles, averaged over 100 simulations and smoothed using a moving average to reduce noise and visualize trends. The red dashed line represents the threshold that indicates our minimum desired efficiency for resampling.
Figure 4: Cumulative reward comparison between RBPF and SIRPF in RB-POMCPOW and POMCPOW. The graph shows cumulative rewards for different sparse grid levels $(q)$ in RB-POMCPOW (with a fixed 50 tree iterations) and varying tree iterations in POMCPOW using SIRPF and RBPF. A theoretical upper bound on the cumulative reward is computed by assuming an ideal scenario where the agent aligns its heading perfectly and moves directly to the target without encountering obstacles.

Rao-Blackwellized POMDP Planning

TL;DR

Abstract

Rao-Blackwellized POMDP Planning

Authors

TL;DR

Abstract

Table of Contents

Figures (4)