Table of Contents
Fetching ...

LLM-Powered Test Case Generation for Detecting Bugs in Plausible Programs

Kaibo Liu, Zhenpeng Chen, Yiyang Liu, Jie M. Zhang, Mark Harman, Yudong Han, Yun Ma, Yihong Dong, Ge Li, Gang Huang

TL;DR

TrickCatcher introduces an LLM-powered framework for detecting tricky bugs in plausible programs by (i) generating PUT-guided program variants, (ii) creating generator-based test inputs, and (iii) applying diversity-driven differential testing. Evaluated on TrickyBugs and EvalPlus, it achieves substantial gains in recall, precision, and F1 over baselines, and ablation confirms the value of each component. The work highlights that buggy variants can aid bug detection and demonstrates reasonable generalization across models, offering a practical path to uncover hidden defects in real-world plausible programs.

Abstract

Detecting tricky bugs in plausible programs, those that pass existing test suites yet still contain bugs, remains a significant challenge in software testing. To address this problem, we propose TrickCatcher, an LLM-powered approach to generating test cases for uncovering bugs in plausible programs. TrickCatcher operates in three stages: First, it uses an LLM to generate program variants based on the program under test (PUT) and its specification. Second, it employs an LLM to construct an input generator from the specification for producing test inputs. Finally, these inputs are executed on both the PUT and its program variants to detect inconsistencies in their outputs. We evaluate TrickCatcher on two datasets, TrickyBugs and EvalPlus, which include 366 human-written and 151 AI-generated plausible programs with tricky bugs. TrickCatcher achieves recall, precision, and F1 scores that are 1.80x, 2.65x, and 1.66x those of the state-of-the-art baselines, respectively. Code and data used are available at https://github.com/RinCloud/TrickCatcher.

LLM-Powered Test Case Generation for Detecting Bugs in Plausible Programs

TL;DR

TrickCatcher introduces an LLM-powered framework for detecting tricky bugs in plausible programs by (i) generating PUT-guided program variants, (ii) creating generator-based test inputs, and (iii) applying diversity-driven differential testing. Evaluated on TrickyBugs and EvalPlus, it achieves substantial gains in recall, precision, and F1 over baselines, and ablation confirms the value of each component. The work highlights that buggy variants can aid bug detection and demonstrates reasonable generalization across models, offering a practical path to uncover hidden defects in real-world plausible programs.

Abstract

Detecting tricky bugs in plausible programs, those that pass existing test suites yet still contain bugs, remains a significant challenge in software testing. To address this problem, we propose TrickCatcher, an LLM-powered approach to generating test cases for uncovering bugs in plausible programs. TrickCatcher operates in three stages: First, it uses an LLM to generate program variants based on the program under test (PUT) and its specification. Second, it employs an LLM to construct an input generator from the specification for producing test inputs. Finally, these inputs are executed on both the PUT and its program variants to detect inconsistencies in their outputs. We evaluate TrickCatcher on two datasets, TrickyBugs and EvalPlus, which include 366 human-written and 151 AI-generated plausible programs with tricky bugs. TrickCatcher achieves recall, precision, and F1 scores that are 1.80x, 2.65x, and 1.66x those of the state-of-the-art baselines, respectively. Code and data used are available at https://github.com/RinCloud/TrickCatcher.
Paper Structure (25 sections, 8 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: A motivating example.
  • Figure 2: Overview of TrickCatcher.
  • Figure 3: Prompt for generating program variants.
  • Figure 4: Prompt for generating test input generator.
  • Figure 5: (RQ2) False positives generated by each approach for correct programs. Lower values indicate better performance. TrickCatcher generates significantly fewer false positives compared to the other methods.
  • ...and 3 more figures