Table of Contents
Fetching ...

Contingency Planning Using Bi-level Markov Decision Processes for Space Missions

Somrita Banerjee, Edward Balaban, Mark Shirley, Kevin Bradner, Marco Pavone

TL;DR

The paper addresses autonomous contingency planning for space science missions with large state and action spaces in rover traverse planning. It introduces a bi-level MDP framework that separates high-level target selection from low-level path planning, enabling rapid policy computation from off-nominal states. Through RoverGridWorld, the authors demonstrate substantial compute-time savings with near-optimal rewards and show that the advantage increases with problem complexity. This approach integrates well with mission-planning workflows, enhances explainability, and supports fast generation of contingency branches for robust autonomous exploration.

Abstract

This work focuses on autonomous contingency planning for scientific missions by enabling rapid policy computation from any off-nominal point in the state space in the event of a delay or deviation from the nominal mission plan. Successful contingency planning involves managing risks and rewards, often probabilistically associated with actions, in stochastic scenarios. Markov Decision Processes (MDPs) are used to mathematically model decision-making in such scenarios. However, in the specific case of planetary rover traverse planning, the vast action space and long planning time horizon pose computational challenges. A bi-level MDP framework is proposed to improve computational tractability, while also aligning with existing mission planning practices and enhancing explainability and trustworthiness of AI-driven solutions. We discuss the conversion of a mission planning MDP into a bi-level MDP, and test the framework on RoverGridWorld, a modified GridWorld environment for rover mission planning. We demonstrate the computational tractability and near-optimal policies achievable with the bi-level MDP approach, highlighting the trade-offs between compute time and policy optimality as the problem's complexity grows. This work facilitates more efficient and flexible contingency planning in the context of scientific missions.

Contingency Planning Using Bi-level Markov Decision Processes for Space Missions

TL;DR

The paper addresses autonomous contingency planning for space science missions with large state and action spaces in rover traverse planning. It introduces a bi-level MDP framework that separates high-level target selection from low-level path planning, enabling rapid policy computation from off-nominal states. Through RoverGridWorld, the authors demonstrate substantial compute-time savings with near-optimal rewards and show that the advantage increases with problem complexity. This approach integrates well with mission-planning workflows, enhances explainability, and supports fast generation of contingency branches for robust autonomous exploration.

Abstract

This work focuses on autonomous contingency planning for scientific missions by enabling rapid policy computation from any off-nominal point in the state space in the event of a delay or deviation from the nominal mission plan. Successful contingency planning involves managing risks and rewards, often probabilistically associated with actions, in stochastic scenarios. Markov Decision Processes (MDPs) are used to mathematically model decision-making in such scenarios. However, in the specific case of planetary rover traverse planning, the vast action space and long planning time horizon pose computational challenges. A bi-level MDP framework is proposed to improve computational tractability, while also aligning with existing mission planning practices and enhancing explainability and trustworthiness of AI-driven solutions. We discuss the conversion of a mission planning MDP into a bi-level MDP, and test the framework on RoverGridWorld, a modified GridWorld environment for rover mission planning. We demonstrate the computational tractability and near-optimal policies achievable with the bi-level MDP approach, highlighting the trade-offs between compute time and policy optimality as the problem's complexity grows. This work facilitates more efficient and flexible contingency planning in the context of scientific missions.
Paper Structure (12 sections, 12 equations, 6 figures, 1 algorithm)

This paper contains 12 sections, 12 equations, 6 figures, 1 algorithm.

Figures (6)

  • Figure 1: A rover path planning problem to collect measurements and drill at targets, while avoiding obstacles and moving sun shadows (not pictured), is solved using a flat MDP and a bi-level MDP. In the bi-level MDP formulation, the high-level MDP decides which scientific target to drill next, while the low-level MDP plans the path and decides placement of measurements. Each target can only be drilled once, and must be preceded by a measurement at a neighboring cell (including diagonal cells). In both formulations, the optimal path is the same. However, solving the bi-level MDP is faster.
  • Figure 2: A sample reward field for RoverGridWorld that includes targets as well as static and dynamic obstacles. The higher reward targets (dark green) correspond to science stations and the lower reward target (light green) corresponds to a hibernation area. Note that the reward for a hibernation area is only included for ease of visualization, and is not important to the problem. The static obstacles (dark red) are used to model hazardous terrain and the dynamic obstacles (light red) are used to model swiftly moving sun shadows, e.g., those near the lunar polar regions. By the end of the time horizon, if the rover is not at the hibernation area, it begins accruing negative rewards.
  • Figure 3: A RoverGridWorld problem is solved using value iteration applied to a flat MDP structure and a bi-level MDP structure. The flat MDP reasons at a granular level while the bi-level MDP consists of a high-level MDP that decides the order of visiting targets and a low-level MDP that handles path planning to the selected target. The paths chosen by the two frameworks are equivalent in terms of Manhattan distance. In both cases, the total reward accrued at the end of the traverse is the same. However, as discussed later, the bi-level MDP uses less computation time than the flat MDP, while achieving near-optimal policies.
  • Figure 4: For experiment 1, comparing four methods of solving the RoverGridWorld problem, we see that the bi-level MDP solved with value iteration converges to the same maximum reward as the flat MDP with value iteration (which solves the problem optimally) but the compute time is about half ($\approx$0.5 vs $\approx$1 second). The reinforcement learning based solvers, Q-learning and SARSA, take longer and achieve smaller rewards, which is expected since they are model-free methods. The shading denotes the error averaged over 500 simulations of each policy. The visual oscillations at the end of the VI lines are due to running multiple runs with varying number of maximum iterations which all converge at the same iteration count, with nearly identical computation times.
  • Figure 5: For experiment 2, the same four methods are compared for a larger RoverGridWorld problem. The bi-level MDP solved with value iteration converges to the optimal policy much faster ($\approx$ 20 seconds) compared to the flat MDP solved with value iteration ($>$ 100 seconds). The shading denotes the error averaged over 500 simulations of each policy. The visual oscillations at the end of the VI lines are due to running multiple runs with varying number of maximum iterations which all converge at the same iteration count, with similar computation times.
  • ...and 1 more figures