Table of Contents
Fetching ...

Enhancing Kubernetes Resilience through Anomaly Detection and Prediction

V. Anemogiannis, B. Andreou, K. Myrtollari, K. Panagidi, S. Hadjiefthymiades

TL;DR

The paper tackles the challenge of monitoring Kubernetes clusters by introducing a graph-based anomaly detection framework that captures inter-component dependencies. It combines unsupervised learning to learn a dynamic normal state with supervised models that predict anomaly probabilities, all within a real-time Neo4j graph representation of the cluster. The approach enables end-to-end tracing from Pods and Nodes to higher-level resources, and supports flexible model combinations via a modular toolbox. Experiments on a live EO4EU deployment show effective identification of problematic Nodes, Pods, Namespaces, and Deployments, highlighting practical utility for proactive cluster management.

Abstract

Kubernetes, in recent years, has become widely used for the deployment and management of software projects on cloud infrastructure. Due to the execution of these applications across numerous Nodes, each one with its unique specifications, it has become a challenge to identify problems and ensure the smooth operation of the application. Effective supervision of the cluster remains a challenging and resource intensive task. This research work focuses on providing a novel framework system maintainer in order to overview all the possible resources in Kubernetes and pay the attention to specific parts of the cluster that may be showcasing problematic behavior. The novelty of this component rises from the use of cluster graphical representation where features, e.g. graph edges and neighboring nodes, are used for anomaly detection. The proposed framework defines the normality in the dynamic enviroment of Kubernetes and the output feeds the supervised models for abnormaliry detection presented in user-friendly graph interface. A variety of model combinations are evaluated and tested in real-life environment.

Enhancing Kubernetes Resilience through Anomaly Detection and Prediction

TL;DR

The paper tackles the challenge of monitoring Kubernetes clusters by introducing a graph-based anomaly detection framework that captures inter-component dependencies. It combines unsupervised learning to learn a dynamic normal state with supervised models that predict anomaly probabilities, all within a real-time Neo4j graph representation of the cluster. The approach enables end-to-end tracing from Pods and Nodes to higher-level resources, and supports flexible model combinations via a modular toolbox. Experiments on a live EO4EU deployment show effective identification of problematic Nodes, Pods, Namespaces, and Deployments, highlighting practical utility for proactive cluster management.

Abstract

Kubernetes, in recent years, has become widely used for the deployment and management of software projects on cloud infrastructure. Due to the execution of these applications across numerous Nodes, each one with its unique specifications, it has become a challenge to identify problems and ensure the smooth operation of the application. Effective supervision of the cluster remains a challenging and resource intensive task. This research work focuses on providing a novel framework system maintainer in order to overview all the possible resources in Kubernetes and pay the attention to specific parts of the cluster that may be showcasing problematic behavior. The novelty of this component rises from the use of cluster graphical representation where features, e.g. graph edges and neighboring nodes, are used for anomaly detection. The proposed framework defines the normality in the dynamic enviroment of Kubernetes and the output feeds the supervised models for abnormaliry detection presented in user-friendly graph interface. A variety of model combinations are evaluated and tested in real-life environment.

Paper Structure

This paper contains 35 sections, 2 equations, 21 figures, 2 tables.

Figures (21)

  • Figure 1: Graphana's Graph Visualization of Monitoring Data
  • Figure 2: Kubernetes Resource Model
  • Figure 3: sing Neo4j to Represent a Workflow in the Kubernetes Cluster. The red node at the top symbolizes the Node the Workflow runs on. The pink node at the middle represents the Namespace created for this Workflow. The orange nodes represent Deployments, while the pink ones represent Replica Sets. The big light blue nodes represent Pods while the green nodes represent the Containers. There are also other types of nodes inside the graph such as Services, Stateful Sets, Ports, Labels, etc.
  • Figure 4: Anomaly Detection and Prediction Component Architecture
  • Figure 5: Position of the Anomaly Detection and Prediction Component in the Architecture
  • ...and 16 more figures