Table of Contents
Fetching ...

Data-Efficient Policy Selection for Navigation in Partial Maps via Subgoal-Based Abstraction

Abhishek Paudel, Gregory J. Stein

TL;DR

This work addresses fast, reliable policy selection for goal-directed navigation in partially mapped environments by combining offline alt-policy replay with Learning over Subgoals Planning (LSP). The method computes lower bounds on how alternative policies would have performed using data collected during deployment and uses these bounds in a constrained UCB bandit to accelerate convergence and reduce cumulative regret. Experiments in simulated maze and office-like environments show substantial improvements (67%–96% reductions in regret) over a baseline bandit approach, even with limited prior knowledge about unseen spaces. The approach leverages LSP's subgoal-based, pose-robust planning to enable reliable offline replay and practical data efficiency for deployment-time policy selection.

Abstract

We present a novel approach for fast and reliable policy selection for navigation in partial maps. Leveraging the recent learning-augmented model-based Learning over Subgoals Planning (LSP) abstraction to plan, our robot reuses data collected during navigation to evaluate how well other alternative policies could have performed via a procedure we call offline alt-policy replay. Costs from offline alt-policy replay constrain policy selection among the LSP-based policies during deployment, allowing for improvements in convergence speed, cumulative regret and average navigation cost. With only limited prior knowledge about the nature of unseen environments, we achieve at least 67% and as much as 96% improvements on cumulative regret over the baseline bandit approach in our experiments in simulated maze and office-like environments.

Data-Efficient Policy Selection for Navigation in Partial Maps via Subgoal-Based Abstraction

TL;DR

This work addresses fast, reliable policy selection for goal-directed navigation in partially mapped environments by combining offline alt-policy replay with Learning over Subgoals Planning (LSP). The method computes lower bounds on how alternative policies would have performed using data collected during deployment and uses these bounds in a constrained UCB bandit to accelerate convergence and reduce cumulative regret. Experiments in simulated maze and office-like environments show substantial improvements (67%–96% reductions in regret) over a baseline bandit approach, even with limited prior knowledge about unseen spaces. The approach leverages LSP's subgoal-based, pose-robust planning to enable reliable offline replay and practical data efficiency for deployment-time policy selection.

Abstract

We present a novel approach for fast and reliable policy selection for navigation in partial maps. Leveraging the recent learning-augmented model-based Learning over Subgoals Planning (LSP) abstraction to plan, our robot reuses data collected during navigation to evaluate how well other alternative policies could have performed via a procedure we call offline alt-policy replay. Costs from offline alt-policy replay constrain policy selection among the LSP-based policies during deployment, allowing for improvements in convergence speed, cumulative regret and average navigation cost. With only limited prior knowledge about the nature of unseen environments, we achieve at least 67% and as much as 96% improvements on cumulative regret over the baseline bandit approach in our experiments in simulated maze and office-like environments.
Paper Structure (21 sections, 7 equations, 5 figures, 2 tables)

This paper contains 21 sections, 7 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Overview of our approach for data-efficient policy selection for navigation in partial maps. Our approach relies upon offline alt-policy replay, a procedure to compute lower bounds of navigation costs for alternate policies after deployment, bounds used to constrain selection.
  • Figure 2: Lower Bound Cost Approximation: (a) Policy $\pi$ guides the robot in a trial. (b) During offline alt-policy replay, policy $\pi'$ attempts to leave known space via subgoal $s'$ to try to reach goal with proposed path $u_a$. The minimum of such paths obtained during replay gives the optimistic lower bound of policy $\pi'$. (c) With simply-connected assumption, the lower bound is the net distance travelled under policy $\pi'$ during offline replay.
  • Figure 3: The Simulated Environments. Robot-view panoramic images from simulation environments (top two rows) and samples of maps from (a) maze (b) office-like environments. All our experiments are conducted in simulated environments rendered using the Unity game engine.
  • Figure 4: Average Navigation Cost (mean) and Cumulative Regret (mean) for deployments in maze environments, Fig. \ref{['fig:environments']}(a). Each deployment consists of 100 randomized navigation trials, each in a previously unseen maze. Mean cost and regret are computed across 200 randomized deployments. For our approach Const-UCB, we show results with optimistic $C^\text{\tiny{}lb,opt}$ and simply connected $C^\text{\tiny{}lb,s.c.}$ lower bounds as discussed in Sec. \ref{['sec:approx_lb']}. The solid lines denote the mean, and the shaded regions show 10th to 90th percentile. The symbols: triangle, diamond and square denote average cost (filled) and cumulative regret (unfilled) at 10th, 40th and 100th trial respectively in both the table and the plot for each environment.
  • Figure 5: Average Navigation Cost (mean) and Cumulative Regret (mean) for deployments in office-centric environments, Fig. \ref{['fig:environments']}(b). Each deployment consists of 100 randomized navigation trials, each in a previously unseen map. Mean cost and regret are computed across 200 randomized deployments. For our approach Const-UCB, we show results with optimistic $C^\text{\tiny{}lb,opt}$, weighted $C^\text{\tiny{}lb,wgt}$ and simply connected $C^\text{\tiny{}lb,s.c.}$ lower bounds as discussed in Sec. \ref{['sec:approx_lb']}. The solid lines denote the mean, and the shaded regions show 10th to 90th percentile. The symbols: triangle, diamond and square denote average cost (filled) and cumulative regret (unfilled) at 10th, 40th and 100th trial respectively in both the table and the plot for each environment.