WANDER: An Explainable Decision-Support Framework for HPC
Ankur Lahiry, Banooqa Banday, Yugesh Bhattarai, Tanzima Z. Islam
TL;DR
Wander addresses the challenge of configuring heterogeneous HPC systems by unifying predictive modeling, counterfactual reasoning, and explainability within a query-driven decision-support framework. It fixes a predictive model $f: \mathcal{X} \rightarrow \mathcal{Y}$ and generates counterfactuals by optimizing a loss $cf\_loss$ that balances target validity, input proximity, and diversity, ensuring realistic, diverse configurations. The system provides interpretable explanations, uncertainty diagnostics, and causally-informed trust checks to support informed decisions across prescriptive, exploratory, and counterfactual queries. Empirical results across multiple HPC datasets demonstrate actionable, trustworthy configuration alternatives with quantified uncertainty and explicit causal reasoning, enabling safer and more efficient resource tuning in heterogeneous environments.
Abstract
High-performance computing (HPC) systems expose many interdependent configuration knobs that impact runtime, resource usage, power, and variability. Existing predictive tools model these outcomes, but do not support structured exploration, explanation, or guided reconfiguration. We present WANDER, a decision-support framework that synthesizes alternate configurations using counterfactual analysis aligned with user goals and constraints. We introduce a composite trade-off score that ranks suggestions based on prediction uncertainty, consistency between feature-target relationships using causal models, and similarity between feature distributions against historical data. To our knowledge, WANDER is the first such system to unify prediction, exploration, and explanation for HPC tuning under a common query interface. Across multiple datasets WANDER generates interpretable and trustworthy, human-readable alternatives that guide users to achieve their performance objectives.
