Table of Contents
Fetching ...

MetaKube: An Experience-Aware LLM Framework for Kubernetes Failure Diagnosis

Wei Sun, Ting Wang, Xinran Tian, Wanshun Lan, Xuhan Feng, Haoyue Li, Fangxin Wang

Abstract

Existing LLM-based Kubernetes diagnostic systems cannot learn from operational experience, operating on static knowledge bases without improving from past resolutions. We present MetaKube, an experience-aware LLM framework through three synergistic innovations: (1) an Episodic Pattern Memory Network (EPMN) that abstracts diagnostic patterns from historical resolutions and provides confidence-calibrated retrieval for both rapid pattern matching and guided causal exploration, (2) a meta-cognitive controller that dynamically routes between intuitive and analytical pathways based on problem familiarity, optimizing the trade-off between speed and depth, and (3) KubeLLM, a locally-deployable 8B model enhanced through domain-specific post-training on our 7,000-sample Kubernetes Fault Resolution Dataset. Evaluation on 1,873 real-world scenarios demonstrates MetaKube transforms Qwen3-8B from 50.9 to 90.5 points, approaching GPT-4.1 performance while ensuring complete data privacy. EPMN contributes 15.3% improvement through experiential learning, with continuous learning experiments showing progressive gains as the system accumulates operational knowledge. The source code and related resources are available at https://github.com/MetaKube-LLM-for-Kubernetes-Diagnosis/MetaKube.

MetaKube: An Experience-Aware LLM Framework for Kubernetes Failure Diagnosis

Abstract

Existing LLM-based Kubernetes diagnostic systems cannot learn from operational experience, operating on static knowledge bases without improving from past resolutions. We present MetaKube, an experience-aware LLM framework through three synergistic innovations: (1) an Episodic Pattern Memory Network (EPMN) that abstracts diagnostic patterns from historical resolutions and provides confidence-calibrated retrieval for both rapid pattern matching and guided causal exploration, (2) a meta-cognitive controller that dynamically routes between intuitive and analytical pathways based on problem familiarity, optimizing the trade-off between speed and depth, and (3) KubeLLM, a locally-deployable 8B model enhanced through domain-specific post-training on our 7,000-sample Kubernetes Fault Resolution Dataset. Evaluation on 1,873 real-world scenarios demonstrates MetaKube transforms Qwen3-8B from 50.9 to 90.5 points, approaching GPT-4.1 performance while ensuring complete data privacy. EPMN contributes 15.3% improvement through experiential learning, with continuous learning experiments showing progressive gains as the system accumulates operational knowledge. The source code and related resources are available at https://github.com/MetaKube-LLM-for-Kubernetes-Diagnosis/MetaKube.
Paper Structure (42 sections, 28 equations, 4 figures, 3 tables, 1 algorithm)

This paper contains 42 sections, 28 equations, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: The MetaKube dual-pathway diagnostic architecture. The system integrates memory-augmented pattern recognition (EPMN) with causal reasoning (KubeGraph) through two processing pathways: an intuitive pathway for rapid pattern-based diagnosis and an analytical pathway for complex causal analysis. A meta-cognitive controller dynamically routes queries based on confidence assessment, optimizing the trade-off between diagnostic speed and depth.
  • Figure 2: Construction of the Kubernetes Fault Resolution Dataset. Generate additional synthetic examples after collecting authentic Kubernetes troubleshooting cases, then performed deduplication, and enhanced the question-answer pairs with chain-of-thought reasoning using LLM.
  • Figure 3: EPMN ablation study results on KubeFault.
  • Figure 4: Performance scores of different models on the KFRD test set (5-point scale).