Entropy-regularized Point-based Value Iteration

Harrison Delecki; Marcell Vazquez-Chanlatte; Esen Yel; Kyle Wray; Tomer Arnon; Stefan Witwicki; Mykel J. Kochenderfer

Entropy-regularized Point-based Value Iteration

Harrison Delecki, Marcell Vazquez-Chanlatte, Esen Yel, Kyle Wray, Tomer Arnon, Stefan Witwicki, Mykel J. Kochenderfer

TL;DR

This work tackles robustness in model-based POMDP planning under both model and goal uncertainty by introducing entropy-regularized planning. The proposed Entropy-regularized PBVI (ERPBVI) maintains a softmax-style, entropy-regularized backup, implemented via per-action Q-functions and the LogSumExp operator, yielding diverse, less overcommitted policies. Across Tiger, GridWorld, and Crosswalk experiments, ERPBVI demonstrates superior robustness to modeling errors and improved objective inference over PBVI, with performance modulated by the regularization strength $\lambda$. The approach provides a practical pathway to more reliable decision-making in uncertain partially observable domains and offers directions for theory and decomposed POMDP fusion.

Abstract

Model-based planners for partially observable problems must accommodate both model uncertainty during planning and goal uncertainty during objective inference. However, model-based planners may be brittle under these types of uncertainty because they rely on an exact model and tend to commit to a single optimal behavior. Inspired by results in the model-free setting, we propose an entropy-regularized model-based planner for partially observable problems. Entropy regularization promotes policy robustness for planning and objective inference by encouraging policies to be no more committed to a single action than necessary. We evaluate the robustness and objective inference performance of entropy-regularized policies in three problem domains. Our results show that entropy-regularized policies outperform non-entropy-regularized baselines in terms of higher expected returns under modeling errors and higher accuracy during objective inference.

Entropy-regularized Point-based Value Iteration

TL;DR

. The approach provides a practical pathway to more reliable decision-making in uncertain partially observable domains and offers directions for theory and decomposed POMDP fusion.

Abstract

Paper Structure (25 sections, 21 equations, 7 figures, 1 table, 3 algorithms)

This paper contains 25 sections, 21 equations, 7 figures, 1 table, 3 algorithms.

Introduction
Background
POMDPs
Belief-state Markov Decision Process
Point-based Value Iteration
Entropy-regularized PBVI
Entropy-regularized Planning Objective
ERPBVI Backup
Policy Extraction
Experiments
Robustness Experiments
Tiger
GridWorld
Objective Inference Experiments
GridWorld
...and 10 more sections

Figures (7)

Figure 1: Illustration of the PBVI backup (left) and entropy-regularized PBVI backup (right). The PBVI updates an estimate of the value function by taking the maximum over Q-values at next belief points. The entropy-regularized variant explicitly models Q-functions for all $n$ actions, and the maximum over Q-values is replaced by $\operatorname{LogSumExp}$ (LSE).
Figure 2: Illustration of the GridWorld POMDP (left). The goal agent tries to reach the lower-right corner while avoiding failure states. The expected visitation rate for the PBVI policy (middle) shows that the agent follows a single path to the goal through the narrow passage. The ERPBVI policy (right) follows multiple paths to the goal.
Figure 3: Illustration of the two GridWorld planning objectives used in the objective inference experiments. The states, actions, and observations are identical in each POMDP.
Figure 4: Robustness results on Tiger POMDP using ERPBVI (solid line) and PBVI (dashed line) for various values of $p_{correct}$. The nominal $p_{correct}=0.85$ is shown in black. When the model is worse than expected ($p_{correct} \leq 0.85$), ERPBVI achieves higher expected return $\mathbb{E}[R]$ than $PBVI$ for $\lambda \approx 1$.
Figure 5: Expected return for PBVI and ERPBVI policies in GridWorld with varying $p_{slip}$. All policies are trained with $p_{slip}=0$. Solid lines indicate ERPBVI performance, while dashed lines indicate PBVI. ERPBVI is less sensitive to changes in the POMDP's transition model when $\lambda \approx 0.02$.
...and 2 more figures

Entropy-regularized Point-based Value Iteration

TL;DR

Abstract

Entropy-regularized Point-based Value Iteration

Authors

TL;DR

Abstract

Table of Contents

Figures (7)