On Sequential Fault-Intolerant Process Planning
Andrzej Kaczmarczyk, Davin Choo, Niclas Boehmer, Milind Tambe, Haifeng Xu
TL;DR
SFIPP addresses fault-intolerant sequential decision-making with unknown stage-wise success probabilities, where the round reward is the product $\prod_{s=1}^m p_{s,i_s}$ and learning occurs over horizon $T$. The authors introduce a staged-bandit framework that yields provably tight regret bounds in both deterministic and probabilistic settings, and show how collapsing stages of the same type can further reduce regret. They propose and analyze specialized algorithms (UTF, SB, SCB, SCFGB) that exploit problem structure, and demonstrate through extensive experiments that these tailored methods outperform a generic bandit approach, with collapses offering clear gains when stage types are correctly identified. The work advances fault-tolerant, multi-stage planning under uncertainty and has potential impact on domains such as drug discovery, security, and quality-critical design.
Abstract
We propose and study a planning problem we call Sequential Fault-Intolerant Process Planning (SFIPP). SFIPP captures a reward structure common in many sequential multi-stage decision problems where the planning is deemed successful only if all stages succeed. Such reward structures are different from classic additive reward structures and arise in important applications such as drug/material discovery, security, and quality-critical product design. We design provably tight online algorithms for settings in which we need to pick between different actions with unknown success chances at each stage. We do so both for the foundational case in which the behavior of actions is deterministic, and the case of probabilistic action outcomes, where we effectively balance exploration for learning and exploitation for planning through the usage of multi-armed bandit algorithms. In our empirical evaluations, we demonstrate that the specialized algorithms we develop, which leverage additional information about the structure of the SFIPP instance, outperform our more general algorithm.
