Table of Contents
Fetching ...

SALLM: Security Assessment of Generated Code

Mohammed Latif Siddiq, Joanna C. S. Santos, Sajith Devareddy, Anna Muller

TL;DR

Sallm tackles the gap between functional-code generation and security by introducing a security-centric benchmarking framework for LLMs. It combines a manually curated, Python-focused prompt dataset, a rule-based code repair module, and dual assessment approaches (static analysis and dynamic testing) with new metrics secure@k and vulnerable@k. The framework enables automated, runnable evaluation across multiple models and temperatures, revealing that state-of-the-art models like GPT-4 excel at functional correctness but lag in security, while certain open models balance correctness with security differently. This work provides practical tooling for automated secure-code evaluation and informs model selection and future research toward secure AI-assisted software engineering.

Abstract

With the growing popularity of Large Language Models (LLMs) in software engineers' daily practices, it is important to ensure that the code generated by these tools is not only functionally correct but also free of vulnerabilities. Although LLMs can help developers to be more productive, prior empirical studies have shown that LLMs can generate insecure code. There are two contributing factors to the insecure code generation. First, existing datasets used to evaluate LLMs do not adequately represent genuine software engineering tasks sensitive to security. Instead, they are often based on competitive programming challenges or classroom-type coding tasks. In real-world applications, the code produced is integrated into larger codebases, introducing potential security risks. Second, existing evaluation metrics primarily focus on the functional correctness of the generated code while ignoring security considerations. Therefore, in this paper, we described SALLM, a framework to benchmark LLMs' abilities to generate secure code systematically. This framework has three major components: a novel dataset of security-centric Python prompts, configurable assessment techniques to evaluate the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.

SALLM: Security Assessment of Generated Code

TL;DR

Sallm tackles the gap between functional-code generation and security by introducing a security-centric benchmarking framework for LLMs. It combines a manually curated, Python-focused prompt dataset, a rule-based code repair module, and dual assessment approaches (static analysis and dynamic testing) with new metrics secure@k and vulnerable@k. The framework enables automated, runnable evaluation across multiple models and temperatures, revealing that state-of-the-art models like GPT-4 excel at functional correctness but lag in security, while certain open models balance correctness with security differently. This work provides practical tooling for automated secure-code evaluation and informs model selection and future research toward secure AI-assisted software engineering.

Abstract

With the growing popularity of Large Language Models (LLMs) in software engineers' daily practices, it is important to ensure that the code generated by these tools is not only functionally correct but also free of vulnerabilities. Although LLMs can help developers to be more productive, prior empirical studies have shown that LLMs can generate insecure code. There are two contributing factors to the insecure code generation. First, existing datasets used to evaluate LLMs do not adequately represent genuine software engineering tasks sensitive to security. Instead, they are often based on competitive programming challenges or classroom-type coding tasks. In real-world applications, the code produced is integrated into larger codebases, introducing potential security risks. Second, existing evaluation metrics primarily focus on the functional correctness of the generated code while ignoring security considerations. Therefore, in this paper, we described SALLM, a framework to benchmark LLMs' abilities to generate secure code systematically. This framework has three major components: a novel dataset of security-centric Python prompts, configurable assessment techniques to evaluate the generated code, and novel metrics to evaluate the models' performance from the perspective of secure code generation.
Paper Structure (37 sections, 1 equation, 3 figures, 4 tables)

This paper contains 37 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Example of a generated code containing a SQL Injection vulnerability.
  • Figure 2: Framework overview
  • Figure 3: Compilation rates before and after applying Sallm's repair rules