Deployability-Centric Infrastructure-as-Code Generation: Fail, Learn, Refine, and Succeed through LLM-Empowered DevOps Simulation
Tianyi Zhang, Shidong Pan, Zejun Zhang, Zhenchang Xing, Xiaoyu Sun
TL;DR
The paper tackles the gap in IaC generation where syntactic correctness does not guarantee deployability. It introduces the DPIaC-Eval benchmark to rigorously evaluate deployability and presents IaCGen, a deployment-driven, iterative framework that couples three-stage validation (format, syntax, live deployment) with iterative feedback, including human-in-the-loop guidance. Results show dramatic gains: initial deployability performance improves from around 25% to over 90% passItr@25 across evaluated models, with Claude variants excelling in later iterations; the approach also demonstrates generalizability to Terraform and highlights persistent trust and security gaps. The work underscores the need for deployability-focused evaluation and security-conscious generation, offering a practical DevOps-inspired paradigm for future AI-assisted infrastructure provisioning.
Abstract
Infrastructure-as-Code (IaC) generation holds significant promise for automating cloud infrastructure provisioning. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions. However, current evaluation focuses on syntactic correctness while ignoring deployability, the critical measure of the utility of IaC configuration files. Six state-of-the-art LLMs performed poorly on deployability, achieving only 20.8$\sim$30.2% deployment success rate on the first attempt. In this paper, we construct DPIaC-Eval, the first deployability-centric IaC template benchmark consisting of 153 real-world scenarios cross 58 unique services. Also, we propose an LLM-based deployability-centric framework, dubbed IaCGen, that uses iterative feedback mechanism encompassing format verification, syntax checking, and live deployment stages, thereby closely mirroring the real DevOps workflows. Results show that IaCGen can make 54.6$\sim$91.6% generated IaC templates from all evaluated models deployable in the first 10 iterations. Additionally, human-in-the-loop feedback that provide direct guidance for the deployability errors, can further boost the performance to over 90% passItr@25 on all evaluated LLMs. Furthermore, we explore the trustworthiness of the generated IaC templates on user intent alignment and security compliance. The poor performance (25.2% user requirement coverage and 8.4% security compliance rate) indicates a critical need for continued research in this domain.
