Table of Contents
Fetching ...

Migrating Existing Container Workload to Kubernetes -- LLM Based Approach and Evaluation

Masaru Ueno, Tetsuya Uchiumi

TL;DR

This paper tackles the challenge of migrating Compose-based workloads to Kubernetes by evaluating LLM-driven manifest synthesis against a purpose-built microbenchmark. It introduces three quality criteria—correctness, context-groundedness, and consistency—and assesses prompts, model varieties, and JSON-mode outputs across 50-sample setups. Key findings show that while LLMs can produce accurate manifests for standard cases, they may omit readability comments and struggle with atypical inputs; structured JSON outputs and expert prompting improve stability, yet non-determinism necessitates human QA and validation. The work provides a rigorous, open benchmark and practical guidance for automating Kubernetes manifest generation from Compose inputs, with implications for DevOps tooling and future research on prompt tuning and postprocessing pipelines.

Abstract

Although Kubernetes has become a widespread open-source system that automates the management of containerized applications, its complexity can be a significant barrier, particularly for application developers unfamiliar with it. One approach employs large language models (LLMs) to assist developers in generating Kubernetes manifests; however it is currently impossible to determine whether the output satisfies given specifications and is comprehensible. In this study, we proposed a benchmarking method for evaluating the effectiveness of LLMs in synthesizing manifests, using the Compose specification -- a standard widely adopted by application developers -- as input. The proposed benchmarking method revealed that LLMs generally produce accurate results that compensate for simple specification gaps. However, we also observed that inline comments for readability were often omitted, and completion accuracy was low for atypical inputs with unclear intentions.

Migrating Existing Container Workload to Kubernetes -- LLM Based Approach and Evaluation

TL;DR

This paper tackles the challenge of migrating Compose-based workloads to Kubernetes by evaluating LLM-driven manifest synthesis against a purpose-built microbenchmark. It introduces three quality criteria—correctness, context-groundedness, and consistency—and assesses prompts, model varieties, and JSON-mode outputs across 50-sample setups. Key findings show that while LLMs can produce accurate manifests for standard cases, they may omit readability comments and struggle with atypical inputs; structured JSON outputs and expert prompting improve stability, yet non-determinism necessitates human QA and validation. The work provides a rigorous, open benchmark and practical guidance for automating Kubernetes manifest generation from Compose inputs, with implications for DevOps tooling and future research on prompt tuning and postprocessing pipelines.

Abstract

Although Kubernetes has become a widespread open-source system that automates the management of containerized applications, its complexity can be a significant barrier, particularly for application developers unfamiliar with it. One approach employs large language models (LLMs) to assist developers in generating Kubernetes manifests; however it is currently impossible to determine whether the output satisfies given specifications and is comprehensible. In this study, we proposed a benchmarking method for evaluating the effectiveness of LLMs in synthesizing manifests, using the Compose specification -- a standard widely adopted by application developers -- as input. The proposed benchmarking method revealed that LLMs generally produce accurate results that compensate for simple specification gaps. However, we also observed that inline comments for readability were often omitted, and completion accuracy was low for atypical inputs with unclear intentions.
Paper Structure (30 sections, 6 figures)

This paper contains 30 sections, 6 figures.

Figures (6)

  • Figure 1: (a) Example of Kubernetes manifests. (b) Example of Compose specifications, including Kompose-specific labels for controlling output manifests.
  • Figure 2: Prompt for evaluation of context-groundedness.
  • Figure 3: Prompts used to investigate impacts on output: (a) zero-shot prompts (b) role prompts extended by ExpertPrompting (bold part), and (c) schema for constraining output. We allowed the specification of a list of either strings or dictionaries because even if a list of strings is specified, a list of dictionaries may be returned, leading to deserialization failure.
  • Figure 4: Evaluation of consistency. The horizontal axis represents the five types of input, the vertical axis represents the distribution of the number of lines of the output manifest, and the color difference represents the variation of prompts.
  • Figure 5: Evaluation of context-groundedness. The horizontal axis represents the five types of input, the vertical axis represents the success rate of the evaluation prompts, and the color difference represents the variation of prompts.
  • ...and 1 more figures