Table of Contents
Fetching ...

Deployability-Centric Infrastructure-as-Code Generation: Fail, Learn, Refine, and Succeed through LLM-Empowered DevOps Simulation

Tianyi Zhang, Shidong Pan, Zejun Zhang, Zhenchang Xing, Xiaoyu Sun

TL;DR

The paper tackles the gap in IaC generation where syntactic correctness does not guarantee deployability. It introduces the DPIaC-Eval benchmark to rigorously evaluate deployability and presents IaCGen, a deployment-driven, iterative framework that couples three-stage validation (format, syntax, live deployment) with iterative feedback, including human-in-the-loop guidance. Results show dramatic gains: initial deployability performance improves from around 25% to over 90% passItr@25 across evaluated models, with Claude variants excelling in later iterations; the approach also demonstrates generalizability to Terraform and highlights persistent trust and security gaps. The work underscores the need for deployability-focused evaluation and security-conscious generation, offering a practical DevOps-inspired paradigm for future AI-assisted infrastructure provisioning.

Abstract

Infrastructure-as-Code (IaC) generation holds significant promise for automating cloud infrastructure provisioning. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions. However, current evaluation focuses on syntactic correctness while ignoring deployability, the critical measure of the utility of IaC configuration files. Six state-of-the-art LLMs performed poorly on deployability, achieving only 20.8$\sim$30.2% deployment success rate on the first attempt. In this paper, we construct DPIaC-Eval, the first deployability-centric IaC template benchmark consisting of 153 real-world scenarios cross 58 unique services. Also, we propose an LLM-based deployability-centric framework, dubbed IaCGen, that uses iterative feedback mechanism encompassing format verification, syntax checking, and live deployment stages, thereby closely mirroring the real DevOps workflows. Results show that IaCGen can make 54.6$\sim$91.6% generated IaC templates from all evaluated models deployable in the first 10 iterations. Additionally, human-in-the-loop feedback that provide direct guidance for the deployability errors, can further boost the performance to over 90% passItr@25 on all evaluated LLMs. Furthermore, we explore the trustworthiness of the generated IaC templates on user intent alignment and security compliance. The poor performance (25.2% user requirement coverage and 8.4% security compliance rate) indicates a critical need for continued research in this domain.

Deployability-Centric Infrastructure-as-Code Generation: Fail, Learn, Refine, and Succeed through LLM-Empowered DevOps Simulation

TL;DR

The paper tackles the gap in IaC generation where syntactic correctness does not guarantee deployability. It introduces the DPIaC-Eval benchmark to rigorously evaluate deployability and presents IaCGen, a deployment-driven, iterative framework that couples three-stage validation (format, syntax, live deployment) with iterative feedback, including human-in-the-loop guidance. Results show dramatic gains: initial deployability performance improves from around 25% to over 90% passItr@25 across evaluated models, with Claude variants excelling in later iterations; the approach also demonstrates generalizability to Terraform and highlights persistent trust and security gaps. The work underscores the need for deployability-focused evaluation and security-conscious generation, offering a practical DevOps-inspired paradigm for future AI-assisted infrastructure provisioning.

Abstract

Infrastructure-as-Code (IaC) generation holds significant promise for automating cloud infrastructure provisioning. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions. However, current evaluation focuses on syntactic correctness while ignoring deployability, the critical measure of the utility of IaC configuration files. Six state-of-the-art LLMs performed poorly on deployability, achieving only 20.830.2% deployment success rate on the first attempt. In this paper, we construct DPIaC-Eval, the first deployability-centric IaC template benchmark consisting of 153 real-world scenarios cross 58 unique services. Also, we propose an LLM-based deployability-centric framework, dubbed IaCGen, that uses iterative feedback mechanism encompassing format verification, syntax checking, and live deployment stages, thereby closely mirroring the real DevOps workflows. Results show that IaCGen can make 54.691.6% generated IaC templates from all evaluated models deployable in the first 10 iterations. Additionally, human-in-the-loop feedback that provide direct guidance for the deployability errors, can further boost the performance to over 90% passItr@25 on all evaluated LLMs. Furthermore, we explore the trustworthiness of the generated IaC templates on user intent alignment and security compliance. The poor performance (25.2% user requirement coverage and 8.4% security compliance rate) indicates a critical need for continued research in this domain.

Paper Structure

This paper contains 27 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: A sample CloudFormation template creating an SNS email subscription with notification recipient email address defined with parameter.
  • Figure 2: Workflow of benchmark DPIaC-Eval construction.
  • Figure 2: Model performance with iterative feedback. A column with an upward arrow↑ indicates its score is higher than the average. pItr stands for passItr.
  • Figure 3: Benchmark characteristics. (a) Distribution of AWS services, presenting the top 20 out of 58 most frequent services. The full list is available in the replication package. (b) Distribution of IaC template difficulty levels, showing balanced coverage across five levels.
  • Figure 4: Pass@1 scores across difficulty levels.
  • ...and 5 more figures