Formal Analysis of Metastable Failures in Software Systems
Peter Alvaro, Rebecca Isaacs, Rupak Majumdar, Kiran-Kumar Muniswamy-Reddy, Mahmoud Salamati, Sadegh Soudjani
TL;DR
This work provides a formal framework for metastable failures in retry-driven request-response systems by modeling such systems as CTMCs derived from a DSL, with data-driven calibration to bridge abstraction and real behavior. It introduces qualitative visualizations that reveal metastable regions and quantitative tools (expected hitting times and eigenvalue-based analysis) to predict recovery times and assess recovery strategies like throttling. The authors implement an open-source tool and validate it on scenarios inspired by real hyperscaler workloads, showing that metastable regimes cause long recovery times and that calibrated CTMCs can predict these effects with practical accuracy and speed. The study advances both theoretical understanding and practical tooling for predicting and mitigating metastable outages in cloud services.
Abstract
Many large-scale software systems demonstrate metastable failures. In this class of failures, a stressor such as a temporary spike in workload causes the system performance to drop and, subsequently, the system performance continues to remain low even when the stressor is removed. These failures have been reported by many large corporations and considered to be a rare but catastrophic source of availability outages in cloud systems. In this paper, we provide the mathematical foundations of metastability in request-response server systems. We model such systems using a domain-specific language. We show how to construct continuous-time Markov chains (CTMCs) that approximate the semantics of the programs through modeling and data-driven calibration. We use the structure of the CTMC models to provide a visualization of the qualitative behavior of the model. The visualization is a surprisingly effective way to identify system parameterizations that cause a system to show metastable behaviors. We complement the qualitative analysis with quantitative predictions. We provide a formal notion of metastable behaviors based on escape probabilities, and show that metastable behaviors are related to the eigenvalue structure of the CTMC. Our characterization leads to algorithmic tools to predict recovery times in metastable models of server systems. We have implemented our technique in a tool for the modeling and analysis of server systems. Through models inspired by failures in real request-response systems, we show that our qualitative visual analysis captures and predicts many instances of metastability that were observed in the field in a matter of milliseconds. Our algorithms confirm that recovery times surge as the system parameters approach metastable modes in the dynamics.
