Table of Contents
Fetching ...

STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

Yinfang Chen, Jiaqi Pan, Jackson Clark, Yiming Su, Noah Zheutlin, Bhavya Bhavya, Rohan Arora, Yu Deng, Saurabh Jha, Tianyin Xu

Abstract

In cloud-scale systems, failures are the norm. A distributed computing cluster exhibits hundreds of machine failures and thousands of disk failures; software bugs and misconfigurations are reported to be more frequent. The demand for autonomous, AI-driven reliability engineering continues to grow, as existing humanin-the-loop practices can hardly keep up with the scale of modern clouds. This paper presents STRATUS, an LLM-based multi-agent system for realizing autonomous Site Reliability Engineering (SRE) of cloud services. STRATUS consists of multiple specialized agents (e.g., for failure detection, diagnosis, mitigation), organized in a state machine to assist system-level safety reasoning and enforcement. We formalize a key safety specification of agentic SRE systems like STRATUS, termed Transactional No-Regression (TNR), which enables safe exploration and iteration. We show that TNR can effectively improve autonomous failure mitigation. STRATUS significantly outperforms state-of-the-art SRE agents in terms of success rate of failure mitigation problems in AIOpsLab and ITBench (two SRE benchmark suites), by at least 1.5 times across various models. STRATUS shows a promising path toward practical deployment of agentic systems for cloud reliability.

STRATUS: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds

Abstract

In cloud-scale systems, failures are the norm. A distributed computing cluster exhibits hundreds of machine failures and thousands of disk failures; software bugs and misconfigurations are reported to be more frequent. The demand for autonomous, AI-driven reliability engineering continues to grow, as existing humanin-the-loop practices can hardly keep up with the scale of modern clouds. This paper presents STRATUS, an LLM-based multi-agent system for realizing autonomous Site Reliability Engineering (SRE) of cloud services. STRATUS consists of multiple specialized agents (e.g., for failure detection, diagnosis, mitigation), organized in a state machine to assist system-level safety reasoning and enforcement. We formalize a key safety specification of agentic SRE systems like STRATUS, termed Transactional No-Regression (TNR), which enables safe exploration and iteration. We show that TNR can effectively improve autonomous failure mitigation. STRATUS significantly outperforms state-of-the-art SRE agents in terms of success rate of failure mitigation problems in AIOpsLab and ITBench (two SRE benchmark suites), by at least 1.5 times across various models. STRATUS shows a promising path toward practical deployment of agentic systems for cloud reliability.

Paper Structure

This paper contains 30 sections, 1 theorem, 8 figures, 10 tables.

Key Result

Lemma 3.1

Every state $s$ in externally visible state transitions (i.e., $s=s_0^e$, or $s$ is a state immediately following a read-only action by $\alpha_D$ or $\alpha_G$, or $s$ is a state following the completion (commit or abort) of a transaction by $\alpha_M or \alpha_U$) satisfies $\mu(s)\le b$.

Figures (8)

  • Figure 1: Overview of Stratus, an LLM-based multi-agent system for autonomous Site Reliability Engineering (SRE) of modern cloud services.
  • Figure 2: The state machine based control-flow logic.
  • Figure 3: An example of the action stack used for reconciliation-based undo.
  • Figure 4: An example problem.
  • Figure 5: Probability density of the retry times per problem.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Lemma 3.1
  • proof : Proof Sketch.