Table of Contents
Fetching ...

LLM-Powered Fully Automated Chaos Engineering: Towards Enabling Anyone to Build Resilient Software Systems at Low Cost

Daisuke Kikuta, Hiroki Ikeuchi, Kengo Tajiri

TL;DR

ChaosEater addresses the high manual labor cost of Chaos Engineering by employing an LLM-driven agentic workflow to automate the entire CE cycle on Kubernetes. The approach partitions CE tasks among specialized LLM agents, leveraging Validation as Code and Chaos Mesh to plan, execute, analyze, and iteratively improve resilience with minimal human intervention. Empirical results from two Kubernetes case studies show low per-cycle costs (roughly $0.21–$0.84) and short run times (about 11–25 minutes), with qualitative validation from humans and multiple LLMs that the cycles are meaningful. The work demonstrates the feasibility of end-to-end CE automation and offers a foundation for broader, lower-cost resilience engineering in cloud-native environments.

Abstract

Chaos Engineering (CE) is an engineering technique aimed at improving the resilience of distributed systems. It involves intentionally injecting faults into a system to test its resilience, uncover weaknesses, and address them before they cause failures in production. Recent CE tools automate the execution of predefined CE experiments. However, planning such experiments and improving the system based on the experimental results still remain manual. These processes are labor-intensive and require multi-domain expertise. To address these challenges and enable anyone to build resilient systems at low cost, this paper proposes ChaosEater, a system that automates the entire CE cycle with Large Language Models (LLMs). It predefines an agentic workflow according to a systematic CE cycle and assigns subdivided processes within the workflow to LLMs. ChaosEater targets CE for software systems built on Kubernetes. Therefore, the LLMs in ChaosEater complete CE cycles through software engineering tasks, including requirement definition, code generation, testing, and debugging. We evaluate ChaosEater through case studies on small- and large-scale Kubernetes systems. The results demonstrate that it consistently completes reasonable CE cycles with significantly low time and monetary costs. Its cycles are also qualitatively validated by human engineers and LLMs.

LLM-Powered Fully Automated Chaos Engineering: Towards Enabling Anyone to Build Resilient Software Systems at Low Cost

TL;DR

ChaosEater addresses the high manual labor cost of Chaos Engineering by employing an LLM-driven agentic workflow to automate the entire CE cycle on Kubernetes. The approach partitions CE tasks among specialized LLM agents, leveraging Validation as Code and Chaos Mesh to plan, execute, analyze, and iteratively improve resilience with minimal human intervention. Empirical results from two Kubernetes case studies show low per-cycle costs (roughly 0.84) and short run times (about 11–25 minutes), with qualitative validation from humans and multiple LLMs that the cycles are meaningful. The work demonstrates the feasibility of end-to-end CE automation and offers a foundation for broader, lower-cost resilience engineering in cloud-native environments.

Abstract

Chaos Engineering (CE) is an engineering technique aimed at improving the resilience of distributed systems. It involves intentionally injecting faults into a system to test its resilience, uncover weaknesses, and address them before they cause failures in production. Recent CE tools automate the execution of predefined CE experiments. However, planning such experiments and improving the system based on the experimental results still remain manual. These processes are labor-intensive and require multi-domain expertise. To address these challenges and enable anyone to build resilient systems at low cost, this paper proposes ChaosEater, a system that automates the entire CE cycle with Large Language Models (LLMs). It predefines an agentic workflow according to a systematic CE cycle and assigns subdivided processes within the workflow to LLMs. ChaosEater targets CE for software systems built on Kubernetes. Therefore, the LLMs in ChaosEater complete CE cycles through software engineering tasks, including requirement definition, code generation, testing, and debugging. We evaluate ChaosEater through case studies on small- and large-scale Kubernetes systems. The results demonstrate that it consistently completes reasonable CE cycles with significantly low time and monetary costs. Its cycles are also qualitatively validated by human engineers and LLMs.

Paper Structure

This paper contains 10 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: The agentic workflow of ChaosEater. It autonomously completes the systematic CE cycle using LLM agents and existing tools. Note that only the representative inputs and outputs of agents are illustrated here. The two K8s clusters within the workflow refer to the same one.
  • Figure 2: Examples of VaC scripts to validate steady states.
  • Figure 3: Qualitative evaluation results of CE cycles for each system. A score of 3 or higher is considered a positive rating.