Table of Contents
Fetching ...

Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy

Hamed Taherkhani, Jiho Shin, Muhammad Ammar Tahir, Md Rakib Hossain Misu, Vineet Sunil Gattani, Hadi Hemmati

TL;DR

VALTEST is introduced, a novel framework that leverages semantic entropy to automatically validate test cases generated by LLMs and provides a robust solution for improving the correctness of LLM-generated test cases used in software testing and code generation.

Abstract

Modern Large Language Model (LLM)-based programming agents often rely on test execution feedback to refine their generated code. These tests are synthetically generated by LLMs. However, LLMs may produce invalid or hallucinated test cases, which can mislead feedback loops and degrade the performance of agents in refining and improving code. This paper introduces VALTEST, a novel framework that leverages semantic entropy to automatically validate test cases generated by LLMs. Analyzing the semantic structure of test cases and computing entropy-based uncertainty measures, VALTEST trains a machine learning model to classify test cases as valid or invalid and filters out invalid test cases. Experiments on multiple benchmark datasets and various LLMs show that VALTEST not only boosts test validity by up to 29% but also improves code generation performance, as evidenced by significant increases in pass@1 scores. Our extensive experiments also reveal that semantic entropy is a reliable indicator to distinguish between valid and invalid test cases, which provides a robust solution for improving the correctness of LLM-generated test cases used in software testing and code generation.

Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy

TL;DR

VALTEST is introduced, a novel framework that leverages semantic entropy to automatically validate test cases generated by LLMs and provides a robust solution for improving the correctness of LLM-generated test cases used in software testing and code generation.

Abstract

Modern Large Language Model (LLM)-based programming agents often rely on test execution feedback to refine their generated code. These tests are synthetically generated by LLMs. However, LLMs may produce invalid or hallucinated test cases, which can mislead feedback loops and degrade the performance of agents in refining and improving code. This paper introduces VALTEST, a novel framework that leverages semantic entropy to automatically validate test cases generated by LLMs. Analyzing the semantic structure of test cases and computing entropy-based uncertainty measures, VALTEST trains a machine learning model to classify test cases as valid or invalid and filters out invalid test cases. Experiments on multiple benchmark datasets and various LLMs show that VALTEST not only boosts test validity by up to 29% but also improves code generation performance, as evidenced by significant increases in pass@1 scores. Our extensive experiments also reveal that semantic entropy is a reliable indicator to distinguish between valid and invalid test cases, which provides a robust solution for improving the correctness of LLM-generated test cases used in software testing and code generation.

Paper Structure

This paper contains 44 sections, 18 equations, 4 figures, 13 tables, 2 algorithms.

Figures (4)

  • Figure 1: An example of test case generation using GPT4o. The check mark indicates a valid test and the cross mark indicates an invalid test. The entropy of function input and expected output parts are displayed on the right.
  • Figure 2: Overall approach of VALTEST
  • Figure 3: Impact of test suite validity rate on code generation performance in Reflexion for BigCodeBench-hard and BigCodeBench-full sets. Pass@1 is the average pass@1 for executions over 5 random test subsets of size 500 with a specific VR.
  • Figure 4: Validity Rate (VR) comparison across 16 experiments. Existing baselines are Naive Entropy (), Basic Entropy (), FirstN (), Semantic Probability (), Semantic Entropy (), and CoT ().