Table of Contents
Fetching ...

ChaosEater: Fully Automating Chaos Engineering with Large Language Models

Daisuke Kikuta, Hiroki Ikeuchi, Kengo Tajiri

TL;DR

ChaosEater addresses the manual bottlenecks of Chaos Engineering by fully automating the CE cycle with LLM-driven agents aligned to a systematic Kubernetes CE workflow. It treats CE tasks as software engineering problems, using Infrastructure as Code to realize hypotheses, plan and execute Chaos Mesh experiments, and iteratively reconfigure deployments. The approach delivers notable reductions in time and cost while maintaining rigorous Validation as Code in the testing process. Case studies on Nginx and SockShop illustrate stable single CE cycles and demonstrated improvements through deployments that increase resilience and redundancy, with human and LLM validation supporting cycle validity. The work lays groundwork for scalable, end-to-end autonomous resiliency engineering in cloud-native environments and highlights future directions like production deployment, broader LLM support, and richer evaluation frameworks.

Abstract

Chaos Engineering (CE) is an engineering technique aimed at improving the resiliency of distributed systems. It involves artificially injecting specific failures into a distributed system and observing its behavior in response. Based on the observation, the system can be proactively improved to handle those failures. Recent CE tools implement the automated execution of predefined CE experiments. However, defining these experiments and improving the system based on the experimental results still remain manual. To reduce the costs of the manual operations, we propose ChaosEater, a system for automating the entire CE operations with Large Language Models (LLMs). It predefines the agentic workflow according to a systematic CE cycle and assigns subdivided operations within the workflow to LLMs. ChaosEater targets CE for Kubernetes systems, which are managed through code (i.e., Infrastructure as Code). Therefore, the LLMs in ChaosEater perform software engineering tasks to complete CE cycles, including requirement definition, code generation, debugging, and testing. We evaluate ChaosEater through case studies on both small and large Kubernetes systems. The results demonstrate that it stably completes reasonable single CE cycles with significantly low time and monetary costs. The CE cycles are also qualitatively validated by human engineers and LLMs.

ChaosEater: Fully Automating Chaos Engineering with Large Language Models

TL;DR

ChaosEater addresses the manual bottlenecks of Chaos Engineering by fully automating the CE cycle with LLM-driven agents aligned to a systematic Kubernetes CE workflow. It treats CE tasks as software engineering problems, using Infrastructure as Code to realize hypotheses, plan and execute Chaos Mesh experiments, and iteratively reconfigure deployments. The approach delivers notable reductions in time and cost while maintaining rigorous Validation as Code in the testing process. Case studies on Nginx and SockShop illustrate stable single CE cycles and demonstrated improvements through deployments that increase resilience and redundancy, with human and LLM validation supporting cycle validity. The work lays groundwork for scalable, end-to-end autonomous resiliency engineering in cloud-native environments and highlights future directions like production deployment, broader LLM support, and richer evaluation frameworks.

Abstract

Chaos Engineering (CE) is an engineering technique aimed at improving the resiliency of distributed systems. It involves artificially injecting specific failures into a distributed system and observing its behavior in response. Based on the observation, the system can be proactively improved to handle those failures. Recent CE tools implement the automated execution of predefined CE experiments. However, defining these experiments and improving the system based on the experimental results still remain manual. To reduce the costs of the manual operations, we propose ChaosEater, a system for automating the entire CE operations with Large Language Models (LLMs). It predefines the agentic workflow according to a systematic CE cycle and assigns subdivided operations within the workflow to LLMs. ChaosEater targets CE for Kubernetes systems, which are managed through code (i.e., Infrastructure as Code). Therefore, the LLMs in ChaosEater perform software engineering tasks to complete CE cycles, including requirement definition, code generation, debugging, and testing. We evaluate ChaosEater through case studies on both small and large Kubernetes systems. The results demonstrate that it stably completes reasonable single CE cycles with significantly low time and monetary costs. The CE cycles are also qualitatively validated by human engineers and LLMs.
Paper Structure (36 sections, 3 figures, 2 tables)

This paper contains 36 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A simplified agentic workflow of ChaosEater. ChaosEater follows the workflow to autonomously complete the systematic CE cycle using LLM agents and existing tools. Note that only the representative inputs and outputs of agents are illustrated here. The two K8s clusters within the workflow refer to the same one.
  • Figure 2: The highlighted outputs for Nginx and SockShop. See Appendix \ref{['adx:nginx']} and \ref{['adx:sockshop']} for their full versions.
  • Figure 3: Qualitative evaluation results of CE cycles for each system. A score of 3 or higher is a positive rating.