Automating the Correctness Assessment of AI-generated Code for Security Contexts

Domenico Cotroneo; Alessio Foggia; Cristina Improta; Pietro Liguori; Roberto Natella

Automating the Correctness Assessment of AI-generated Code for Security Contexts

Domenico Cotroneo, Alessio Foggia, Cristina Improta, Pietro Liguori, Roberto Natella

TL;DR

This work tackles the problem of automatically validating the correctness of AI-generated security-oriented assembly code, where traditional text-similarity metrics fall short. It introduces ACCA, a pipeline that first checks for exact matches, then verifies syntactic compilability, and finally assesses semantic equivalence via symbolic execution and SMT-based state comparison. Empirical results show ACCA achieves a strong correlation with human evaluation (average $r=0.84$) and outperforms baseline metrics and ChatGPT-based assessments across multiple code generators, while remaining fully automated and comparatively fast (averages around $0.43$ s per snippet). The approach offers a scalable, language- and architecture-agnostic way to judge semantic correctness of AI-generated code in security contexts, with potential extensions to static analysis, binary matching, and higher-level languages.

Abstract

Evaluating the correctness of code generated by AI is a challenging open problem. In this paper, we propose a fully automated method, named ACCA, to evaluate the correctness of AI-generated code for security purposes. The method uses symbolic execution to assess whether the AI-generated code behaves as a reference implementation. We use ACCA to assess four state-of-the-art models trained to generate security-oriented assembly code and compare the results of the evaluation with different baseline solutions, including output similarity metrics, widely used in the field, and the well-known ChatGPT, the AI-powered language model developed by OpenAI. Our experiments show that our method outperforms the baseline solutions and assesses the correctness of the AI-generated code similar to the human-based evaluation, which is considered the ground truth for the assessment in the field. Moreover, ACCA has a very strong correlation with the human evaluation (Pearson's correlation coefficient r=0.84 on average). Finally, since it is a fully automated solution that does not require any human intervention, the proposed method performs the assessment of every code snippet in ~0.17s on average, which is definitely lower than the average time required by human analysts to manually inspect the code, based on our experience.

Automating the Correctness Assessment of AI-generated Code for Security Contexts

TL;DR

) and outperforms baseline metrics and ChatGPT-based assessments across multiple code generators, while remaining fully automated and comparatively fast (averages around

s per snippet). The approach offers a scalable, language- and architecture-agnostic way to judge semantic correctness of AI-generated code in security contexts, with potential extensions to static analysis, binary matching, and higher-level languages.

Abstract

Paper Structure (20 sections, 4 figures, 8 tables)

This paper contains 20 sections, 4 figures, 8 tables.

Introduction
Motivating Example
Proposed Method
String Comparison
Assembling Process
Symbolic Execution
Implementation Details
Experimental Setup
AI-code Generation
Dataset
Baseline Assessment Solutions
Human Evaluation
Experimental Results
Quantitative Analysis
Qualitative Analysis
...and 5 more sections

Figures (4)

Figure 1: Detailed flowchart of ACCA.
Figure 2: Example of a code snippet represented as a sequence of basic blocks.
Figure 3: Example of prompt used for code generation with ChatGPT.
Figure 4: Example of prompt used for code correctness assessment with respect to the English description using ChatGPT.

Automating the Correctness Assessment of AI-generated Code for Security Contexts

TL;DR

Abstract

Automating the Correctness Assessment of AI-generated Code for Security Contexts

Authors

TL;DR

Abstract

Table of Contents

Figures (4)