CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation

Yifei Xu; Yuning Chen; Xumiao Zhang; Xianshang Lin; Pan Hu; Yunfei Ma; Songwu Lu; Wan Du; Zhuoqing Mao; Ennan Zhai; Dennis Cai

CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation

Yifei Xu, Yuning Chen, Xumiao Zhang, Xianshang Lin, Pan Hu, Yunfei Ma, Songwu Lu, Wan Du, Zhuoqing Mao, Ennan Zhai, Dennis Cai

TL;DR

CloudEval-YAML introduces a hand-written, YAML-centric benchmark and an end-to-end evaluation platform for cloud configuration generation, addressing the fragmentation of cloud-native tooling. The approach pairs 1011 practical problems with unit tests and augmented prompts to enable scalable, YAML-aware evaluation across 12 LLMs, revealing that proprietary models markedly outperform open-source ones and that multi-sample generation can reduce cost while few-shot prompting offers limited gains. The platform combines text-level, YAML-aware, and unit-test metrics, delivering actionable insights into model strengths, failure modes, and cost-performance trade-offs for cloud-native code generation. This work provides a practical, scalable benchmark that can guide model development and evaluation for cloud-configuration tasks with direct implications for real-world deployment and tooling improvements.

Abstract

Among the thriving ecosystem of cloud computing and the proliferation of Large Language Model (LLM)-based code generation tools, there is a lack of benchmarking for code generation in cloud-native applications. In response to this need, we present CloudEval-YAML, a practical benchmark for cloud configuration generation. CloudEval-YAML tackles the diversity challenge by focusing on YAML, the de facto standard of numerous cloud-native tools. We develop the CloudEval-YAML benchmark with practicality in mind: the dataset consists of hand-written problems with unit tests targeting practical scenarios. We further enhanced the dataset to meet practical needs by rephrasing questions in a concise, abbreviated, and bilingual manner. The dataset consists of 1011 problems that take more than 1200 human hours to complete. To improve practicality during evaluation, we build a scalable evaluation platform for CloudEval-YAML that achieves a 20 times speedup over a single machine. To the best of our knowledge, the CloudEval-YAML dataset is the first hand-written dataset targeting cloud-native applications. We present an in-depth evaluation of 12 LLMs, leading to a deeper understanding of the problems and LLMs, as well as effective methods to improve task performance and reduce cost.

CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation

TL;DR

Abstract

Paper Structure (17 sections, 9 figures, 7 tables)

This paper contains 17 sections, 9 figures, 7 tables.

Introduction
The CloudEval-YAML Dataset
Overall Structure
Practical Data Augmentation
Statistics of the CloudEval-YAML dataset
The CloudEval-YAML Benchmark Platform
YAML Answer Generation
Performance Score Calculation
Cloud-based Evaluation Framework
Running Cost of the Benchmark
Evaluations on CloudEval-YAML
Comprehensive Benchmark and Analysis
Multi-sample Generation
Few-shot Prompting
Predicting Unit Test Results
...and 2 more sections

Figures (9)

Figure 1: An example problem of the CloudEval-YAML dataset, including a problem specification in natural language with an optional sample YAML file as prompt to LLMs, a reference YAML file, and a bash unit test script, to evaluate the YAML output from the LLM.
Figure 2: The practical data augmentation framework with examples of a simplified and a translated question.
Figure 3: Workflow of the CloudEval-YAML benchmark platform.
Figure 4: Architecture of shared Docker image caching.
Figure 5: Evaluation time over all 1011 problems.
...and 4 more figures

CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation

TL;DR

Abstract

CloudEval-YAML: A Practical Benchmark for Cloud Configuration Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)