Entropy-regularized Point-based Value Iteration
Harrison Delecki, Marcell Vazquez-Chanlatte, Esen Yel, Kyle Wray, Tomer Arnon, Stefan Witwicki, Mykel J. Kochenderfer
TL;DR
This work tackles robustness in model-based POMDP planning under both model and goal uncertainty by introducing entropy-regularized planning. The proposed Entropy-regularized PBVI (ERPBVI) maintains a softmax-style, entropy-regularized backup, implemented via per-action Q-functions and the LogSumExp operator, yielding diverse, less overcommitted policies. Across Tiger, GridWorld, and Crosswalk experiments, ERPBVI demonstrates superior robustness to modeling errors and improved objective inference over PBVI, with performance modulated by the regularization strength $\lambda$. The approach provides a practical pathway to more reliable decision-making in uncertain partially observable domains and offers directions for theory and decomposed POMDP fusion.
Abstract
Model-based planners for partially observable problems must accommodate both model uncertainty during planning and goal uncertainty during objective inference. However, model-based planners may be brittle under these types of uncertainty because they rely on an exact model and tend to commit to a single optimal behavior. Inspired by results in the model-free setting, we propose an entropy-regularized model-based planner for partially observable problems. Entropy regularization promotes policy robustness for planning and objective inference by encouraging policies to be no more committed to a single action than necessary. We evaluate the robustness and objective inference performance of entropy-regularized policies in three problem domains. Our results show that entropy-regularized policies outperform non-entropy-regularized baselines in terms of higher expected returns under modeling errors and higher accuracy during objective inference.
