Table of Contents
Fetching ...

Formal Analysis of Metastable Failures in Software Systems

Peter Alvaro, Rebecca Isaacs, Rupak Majumdar, Kiran-Kumar Muniswamy-Reddy, Mahmoud Salamati, Sadegh Soudjani

TL;DR

This work provides a formal framework for metastable failures in retry-driven request-response systems by modeling such systems as CTMCs derived from a DSL, with data-driven calibration to bridge abstraction and real behavior. It introduces qualitative visualizations that reveal metastable regions and quantitative tools (expected hitting times and eigenvalue-based analysis) to predict recovery times and assess recovery strategies like throttling. The authors implement an open-source tool and validate it on scenarios inspired by real hyperscaler workloads, showing that metastable regimes cause long recovery times and that calibrated CTMCs can predict these effects with practical accuracy and speed. The study advances both theoretical understanding and practical tooling for predicting and mitigating metastable outages in cloud services.

Abstract

Many large-scale software systems demonstrate metastable failures. In this class of failures, a stressor such as a temporary spike in workload causes the system performance to drop and, subsequently, the system performance continues to remain low even when the stressor is removed. These failures have been reported by many large corporations and considered to be a rare but catastrophic source of availability outages in cloud systems. In this paper, we provide the mathematical foundations of metastability in request-response server systems. We model such systems using a domain-specific language. We show how to construct continuous-time Markov chains (CTMCs) that approximate the semantics of the programs through modeling and data-driven calibration. We use the structure of the CTMC models to provide a visualization of the qualitative behavior of the model. The visualization is a surprisingly effective way to identify system parameterizations that cause a system to show metastable behaviors. We complement the qualitative analysis with quantitative predictions. We provide a formal notion of metastable behaviors based on escape probabilities, and show that metastable behaviors are related to the eigenvalue structure of the CTMC. Our characterization leads to algorithmic tools to predict recovery times in metastable models of server systems. We have implemented our technique in a tool for the modeling and analysis of server systems. Through models inspired by failures in real request-response systems, we show that our qualitative visual analysis captures and predicts many instances of metastability that were observed in the field in a matter of milliseconds. Our algorithms confirm that recovery times surge as the system parameters approach metastable modes in the dynamics.

Formal Analysis of Metastable Failures in Software Systems

TL;DR

This work provides a formal framework for metastable failures in retry-driven request-response systems by modeling such systems as CTMCs derived from a DSL, with data-driven calibration to bridge abstraction and real behavior. It introduces qualitative visualizations that reveal metastable regions and quantitative tools (expected hitting times and eigenvalue-based analysis) to predict recovery times and assess recovery strategies like throttling. The authors implement an open-source tool and validate it on scenarios inspired by real hyperscaler workloads, showing that metastable regimes cause long recovery times and that calibrated CTMCs can predict these effects with practical accuracy and speed. The study advances both theoretical understanding and practical tooling for predicting and mitigating metastable outages in cloud services.

Abstract

Many large-scale software systems demonstrate metastable failures. In this class of failures, a stressor such as a temporary spike in workload causes the system performance to drop and, subsequently, the system performance continues to remain low even when the stressor is removed. These failures have been reported by many large corporations and considered to be a rare but catastrophic source of availability outages in cloud systems. In this paper, we provide the mathematical foundations of metastability in request-response server systems. We model such systems using a domain-specific language. We show how to construct continuous-time Markov chains (CTMCs) that approximate the semantics of the programs through modeling and data-driven calibration. We use the structure of the CTMC models to provide a visualization of the qualitative behavior of the model. The visualization is a surprisingly effective way to identify system parameterizations that cause a system to show metastable behaviors. We complement the qualitative analysis with quantitative predictions. We provide a formal notion of metastable behaviors based on escape probabilities, and show that metastable behaviors are related to the eigenvalue structure of the CTMC. Our characterization leads to algorithmic tools to predict recovery times in metastable models of server systems. We have implemented our technique in a tool for the modeling and analysis of server systems. Through models inspired by failures in real request-response systems, we show that our qualitative visual analysis captures and predicts many instances of metastability that were observed in the field in a matter of milliseconds. Our algorithms confirm that recovery times surge as the system parameters approach metastable modes in the dynamics.

Paper Structure

This paper contains 48 sections, 2 theorems, 25 equations, 30 figures.

Key Result

theorem 1

Let $\mathcal{M} = (S, Q)$ be an ergodic, finite CTMC. Let $D$ be a set of metastable points, $|D| = k$. Define $D_k = D$, and Then $-Q$ has $k$ eigenvalues $0 = \lambda_1 < \lambda_2 < \ldots < \lambda_k$, and

Figures (30)

  • Figure 2: A simple example.
  • Figure 3: Metastability, pictorially.
  • Figure 6: Overall scheme for analyzing metastability in server systems.
  • Figure 7: Simulator implementation. The simulator gives an operational semantics to the DSL. We use Python's asyncio library. async denotes an asynchronous call (a future), await waits for an asynchronous call to finish. sleep blocks until some time has passed. tasks are run on a separate thread and does not block the main thread; wait_for waits for an asynchronous task to finish, shield ensures tasks are not cancelled. Internally, the async runtime maintains state in the form of requests, futures, and timers.
  • Figure 8: An M/M/1 queue in the DSL.
  • ...and 25 more figures

Theorems & Definitions (5)

  • Remark 1
  • Remark 2: Finite state CTMCs
  • definition 1: Metastability
  • theorem 1
  • lemma 1