Table of Contents
Fetching ...

GenSIaC: Toward Security-Aware Infrastructure-as-Code Generation with Large Language Models

Yikun Li, Matteo Grella, Daniel Nahmias, Gal Engelberg, Dan Klein, Giancarlo Guizzardi, Thijs van Ede, Andrea Continella

TL;DR

This work tackles the security risks in Infrastructure-as-Code (IaC) generated by Large Language Models (LLMs) by evaluating base models, constructing a security-focused instruction-tuning dataset, and fine-tuning models to generate security-aware IaC while recognizing security weaknesses. The authors introduce GenSIaC, which leverages LoRA-based, parameter-efficient fine-tuning and a two-pronged dataset (code generation and code inspection) derived from large IaC corpora to teach models about nine commonIaC security weaknesses. Empirical results show substantial improvements in $F_1$ scores for weakness detection (e.g., from around $0.3$ to $0.86$ in some settings) and near-perfect syntactic correctness, with strong cross-language generalization while maintaining competitiveness with GPT-4 on several metrics. GenSIaC thus provides a scalable, resource-efficient path to secure IaC generation and inspection, with clear directions for extending weakness coverage and exploring alternative tuning methods.

Abstract

In recent years, Infrastructure as Code (IaC) has emerged as a critical approach for managing and provisioning IT infrastructure through code and automation. IaC enables organizations to create scalable and consistent environments, effectively managing servers and development settings. However, the growing complexity of cloud infrastructures has led to an increased risk of misconfigurations and security vulnerabilities in IaC scripts. To address this problem, this paper investigates the potential of Large Language Models (LLMs) in generating security-aware IaC code, avoiding misconfigurations introduced by developers and administrators. While LLMs have made significant progress in natural language processing and code generation, their ability to generate secure IaC scripts remains unclear. This paper addresses two major problems: 1) the lack of understanding of security weaknesses in IaC scripts generated by LLMs, and 2) the absence of techniques for enhancing security in generating IaC code with LLMs. To assess the extent to which LLMs contain security knowledge, we first conduct a comprehensive evaluation of base LLMs in recognizing major IaC security weaknesses during the generation and inspection of IaC code. Then, we propose GenSIaC, an instruction fine-tuning dataset designed to improve LLMs' ability to recognize potential security weaknesses. Leveraging GenSIaC, we fine-tune LLMs and instruct models to generate security-aware IaC code. Our evaluation demonstrates that our models achieve substantially improved performance in recognizing and preventing IaC security misconfigurations, e.g., boosting the F1-score from 0.303 to 0.858. Additionally, we perform ablation studies and explore GenSIaC's generalizability to other LLMs and its cross-language capabilities.

GenSIaC: Toward Security-Aware Infrastructure-as-Code Generation with Large Language Models

TL;DR

This work tackles the security risks in Infrastructure-as-Code (IaC) generated by Large Language Models (LLMs) by evaluating base models, constructing a security-focused instruction-tuning dataset, and fine-tuning models to generate security-aware IaC while recognizing security weaknesses. The authors introduce GenSIaC, which leverages LoRA-based, parameter-efficient fine-tuning and a two-pronged dataset (code generation and code inspection) derived from large IaC corpora to teach models about nine commonIaC security weaknesses. Empirical results show substantial improvements in scores for weakness detection (e.g., from around to in some settings) and near-perfect syntactic correctness, with strong cross-language generalization while maintaining competitiveness with GPT-4 on several metrics. GenSIaC thus provides a scalable, resource-efficient path to secure IaC generation and inspection, with clear directions for extending weakness coverage and exploring alternative tuning methods.

Abstract

In recent years, Infrastructure as Code (IaC) has emerged as a critical approach for managing and provisioning IT infrastructure through code and automation. IaC enables organizations to create scalable and consistent environments, effectively managing servers and development settings. However, the growing complexity of cloud infrastructures has led to an increased risk of misconfigurations and security vulnerabilities in IaC scripts. To address this problem, this paper investigates the potential of Large Language Models (LLMs) in generating security-aware IaC code, avoiding misconfigurations introduced by developers and administrators. While LLMs have made significant progress in natural language processing and code generation, their ability to generate secure IaC scripts remains unclear. This paper addresses two major problems: 1) the lack of understanding of security weaknesses in IaC scripts generated by LLMs, and 2) the absence of techniques for enhancing security in generating IaC code with LLMs. To assess the extent to which LLMs contain security knowledge, we first conduct a comprehensive evaluation of base LLMs in recognizing major IaC security weaknesses during the generation and inspection of IaC code. Then, we propose GenSIaC, an instruction fine-tuning dataset designed to improve LLMs' ability to recognize potential security weaknesses. Leveraging GenSIaC, we fine-tune LLMs and instruct models to generate security-aware IaC code. Our evaluation demonstrates that our models achieve substantially improved performance in recognizing and preventing IaC security misconfigurations, e.g., boosting the F1-score from 0.303 to 0.858. Additionally, we perform ablation studies and explore GenSIaC's generalizability to other LLMs and its cross-language capabilities.

Paper Structure

This paper contains 69 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Simplified example of Hard-Coded Secret in an IaC script.
  • Figure 2: Example of Code Generation Data.
  • Figure 3: Example of Code Inspection Data.
  • Figure 4: Overall F1-score for nine security weaknesses detection.
  • Figure 5: Comparison between base LLMs and GenSIaC in generating syntactically and functionally correct code.
  • ...and 3 more figures