Table of Contents
Fetching ...

Mutiny! How does Kubernetes fail, and what can we do about it?

Marco Barletta, Marcello Cinque, Catello Di Martino, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer

TL;DR

This paper addresses Kubernetes failure modes by combining real-world incident analysis with a fault injection framework (Mutiny) that perturbs the cluster data store (Etcd) to study failure propagation and resiliency. It develops a Field Failure Data Analysis (FFDA) to classify orchestrator-level faults from online reports and demonstrates that state alterations can trigger system-wide outages, provisioning imbalances, and network issues, with dependency-field errors being a major contributor. The Mutiny framework conducts thousands of injections and reveals that a majority of real-world failures can be reproduced in a controlled setting, providing quantitative insight into which fields and relationships are most fragile and how failures propagate to clients. The authors argue for integrating systematic, data-store–level resiliency testing into the software development lifecycle, outline concrete mitigations (validation, logging, rollbacks, and resource quotas), and propose extending this approach to other orchestration systems to improve dependability for critical cloud-native workloads.

Abstract

In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/overprovisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.

Mutiny! How does Kubernetes fail, and what can we do about it?

TL;DR

This paper addresses Kubernetes failure modes by combining real-world incident analysis with a fault injection framework (Mutiny) that perturbs the cluster data store (Etcd) to study failure propagation and resiliency. It develops a Field Failure Data Analysis (FFDA) to classify orchestrator-level faults from online reports and demonstrates that state alterations can trigger system-wide outages, provisioning imbalances, and network issues, with dependency-field errors being a major contributor. The Mutiny framework conducts thousands of injections and reveals that a majority of real-world failures can be reproduced in a controlled setting, providing quantitative insight into which fields and relationships are most fragile and how failures propagate to clients. The authors argue for integrating systematic, data-store–level resiliency testing into the software development lifecycle, outline concrete mitigations (validation, logging, rollbacks, and resource quotas), and propose extending this approach to other orchestration systems to improve dependability for critical cloud-native workloads.

Abstract

In this paper, we i) analyze and classify real-world failures of Kubernetes (the most popular container orchestration system), ii) develop a framework to perform a fault/error injection campaign targeting the data store preserving the cluster state, and iii) compare results of our fault/error injection experiments with real-world failures, showing that our fault/error injections can recreate many real-world failure patterns. The paper aims to address the lack of studies on systematic analyses of Kubernetes failures to date. Our results show that even a single fault/error (e.g., a bit-flip) in the data stored can propagate, causing cluster-wide failures (3% of injections), service networking issues (4%), and service under/overprovisioning (24%). Errors in the fields tracking dependencies between object caused 51% of such cluster-wide failures. We argue that controlled fault/error injection-based testing should be employed to proactively assess Kubernetes' resiliency and guide the design of failure mitigation strategies.
Paper Structure (30 sections, 7 figures, 7 tables)

This paper contains 30 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Architecture of K8s.
  • Figure 2: Example cluster outage Out failure. A timeout during the control plane startup caused an intermittent Apiserver downtime. This caused Kubelets to be unable to report Node health, leading to a massive Node deletion and recreation by the Google Kubernetes Engine (GKE) autoscaler.
  • Figure 3: Fault injection framework.
  • Figure 4: Experimental workflow.
  • Figure 5: On the left, a golden run time series ($z\_score=-0.2$). On the right, an injection time series ($z\_score=11.0$)
  • ...and 2 more figures