Data-Efficient Policy Selection for Navigation in Partial Maps via Subgoal-Based Abstraction
Abhishek Paudel, Gregory J. Stein
TL;DR
This work addresses fast, reliable policy selection for goal-directed navigation in partially mapped environments by combining offline alt-policy replay with Learning over Subgoals Planning (LSP). The method computes lower bounds on how alternative policies would have performed using data collected during deployment and uses these bounds in a constrained UCB bandit to accelerate convergence and reduce cumulative regret. Experiments in simulated maze and office-like environments show substantial improvements (67%–96% reductions in regret) over a baseline bandit approach, even with limited prior knowledge about unseen spaces. The approach leverages LSP's subgoal-based, pose-robust planning to enable reliable offline replay and practical data efficiency for deployment-time policy selection.
Abstract
We present a novel approach for fast and reliable policy selection for navigation in partial maps. Leveraging the recent learning-augmented model-based Learning over Subgoals Planning (LSP) abstraction to plan, our robot reuses data collected during navigation to evaluate how well other alternative policies could have performed via a procedure we call offline alt-policy replay. Costs from offline alt-policy replay constrain policy selection among the LSP-based policies during deployment, allowing for improvements in convergence speed, cumulative regret and average navigation cost. With only limited prior knowledge about the nature of unseen environments, we achieve at least 67% and as much as 96% improvements on cumulative regret over the baseline bandit approach in our experiments in simulated maze and office-like environments.
