Table of Contents
Fetching ...

CodeSift: An LLM-Based Reference-Less Framework for Automatic Code Validation

Pooja Aggarwal, Oishik Chatterjee, Ting Dai, Prateeti Mohapatra, Brent Paulovicks, Brad Blancett, Arthur De Magalhaes

TL;DR

CodeSift presents a reference-free, first-line framework for automated code validation using LLMs to assess functional correctness without executing code or requiring ground-truth references. It combines syntax checking with a three-phase semantic evaluation (code-to-functionality, similarity, difference) and ensemble synthesis to label code as correct. Evaluated on Python datasets (HumanEval, MBPP+) and a Bash dataset of 100 tasks, CodeSift consistently outperforms ICE-Score and Reference Grading and aligns well with human judgments in SME studies. The work demonstrates the practical potential of LLM-based text-to-text validation to reduce validation effort and improve safety in automated code generation pipelines.

Abstract

The advent of large language models (LLMs) has greatly facilitated code generation, but ensuring the functional correctness of generated code remains a challenge. Traditional validation methods are often time-consuming, error-prone, and impractical for large volumes of code. We introduce CodeSift, a novel framework that leverages LLMs as the first-line filter of code validation without the need for execution, reference code, or human feedback, thereby reducing the validation effort. We assess the effectiveness of our method across three diverse datasets encompassing two programming languages. Our results indicate that CodeSift outperforms state-of-the-art code evaluation methods. Internal testing conducted with subject matter experts reveals that the output generated by CodeSift is in line with human preference, reinforcing its effectiveness as a dependable automated code validation tool.

CodeSift: An LLM-Based Reference-Less Framework for Automatic Code Validation

TL;DR

CodeSift presents a reference-free, first-line framework for automated code validation using LLMs to assess functional correctness without executing code or requiring ground-truth references. It combines syntax checking with a three-phase semantic evaluation (code-to-functionality, similarity, difference) and ensemble synthesis to label code as correct. Evaluated on Python datasets (HumanEval, MBPP+) and a Bash dataset of 100 tasks, CodeSift consistently outperforms ICE-Score and Reference Grading and aligns well with human judgments in SME studies. The work demonstrates the practical potential of LLM-based text-to-text validation to reduce validation effort and improve safety in automated code generation pipelines.

Abstract

The advent of large language models (LLMs) has greatly facilitated code generation, but ensuring the functional correctness of generated code remains a challenge. Traditional validation methods are often time-consuming, error-prone, and impractical for large volumes of code. We introduce CodeSift, a novel framework that leverages LLMs as the first-line filter of code validation without the need for execution, reference code, or human feedback, thereby reducing the validation effort. We assess the effectiveness of our method across three diverse datasets encompassing two programming languages. Our results indicate that CodeSift outperforms state-of-the-art code evaluation methods. Internal testing conducted with subject matter experts reveals that the output generated by CodeSift is in line with human preference, reinforcing its effectiveness as a dependable automated code validation tool.
Paper Structure (13 sections, 1 figure, 4 tables)

This paper contains 13 sections, 1 figure, 4 tables.

Figures (1)

  • Figure 1: CodeSift: Automated Script Validation System Diagram and Prompt Examples. Prompt 1 extracts code functionality, Prompt 2 assesses task-code similarity, and Prompt 3 analyzes differences."