Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

Tsz-On Li; Wenxi Zong; Yibo Wang; Haoye Tian; Ying Wang; Shing-Chi Cheung; Jeff Kramer

Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

Tsz-On Li, Wenxi Zong, Yibo Wang, Haoye Tian, Ying Wang, Shing-Chi Cheung, Jeff Kramer

TL;DR

This paper tackles the challenge of automatically finding failure-inducing test cases using ChatGPT, revealing that direct prompting yields limited success due to the model's insensitivity to subtle code nuances. It introduces Differential Prompting, a three-stage workflow comprising program intention inference, generation of reference versions, and differential testing to reveal failures. The approach dramatically improves success rates on QuixBugs (up to 75%) and Codeforces problems (up to 41%), aided by high intention inference accuracy (91%) and substantial good-reference-version generation (74.6%). The work shows that aligning LLM capabilities with differential testing can effectively locate failures, with potential for education and scaling to larger software, and provides artifacts for reproducibility. Overall, Differential Prompting represents a significant step toward leveraging LLMs for automatic fault detection and program analysis.

Abstract

Automatically detecting software failures is an important task and a longstanding challenge. It requires finding failure-inducing test cases whose test input can trigger the software's fault, and constructing an automated oracle to detect the software's incorrect behaviors. Recent advancement of large language models (LLMs) motivates us to study how far this challenge can be addressed by ChatGPT, a state-of-the-art LLM. Unfortunately, our study shows that ChatGPT has a low probability (28.8%) of finding correct failure-inducing test cases for buggy programs. A possible reason is that finding failure-inducing test cases requires analyzing the subtle code differences between a buggy program and its correct version. When these two versions have similar syntax, ChatGPT is weak at recognizing subtle code differences. Our insight is that ChatGPT's performance can be substantially enhanced when ChatGPT is guided to focus on the subtle code difference. We have an interesting observation that ChatGPT is effective in inferring the intended behaviors of a buggy program. The intended behavior can be leveraged to synthesize programs, in order to make the subtle code difference between a buggy program and its correct version (i.e., the synthesized program) explicit. Driven by this observation, we propose a novel approach that synergistically combines ChatGPT and differential testing to find failure-inducing test cases. We evaluate our approach on Quixbugs (a benchmark of buggy programs), and compare it with state-of-the-art baselines, including direct use of ChatGPT and Pynguin. The experimental result shows that our approach has a much higher probability (77.8%) of finding correct failure-inducing test cases, 2.7X as the best baseline.

Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

TL;DR

Abstract

Paper Structure (34 sections, 7 figures, 3 tables)

This paper contains 34 sections, 7 figures, 3 tables.

Introduction
Preliminaries
Methodology
Program Generator
Overview of Program Generator
Illustration of Program Generator's workflow
Test Case Generator
Step 1: Generating test input
Step 2: Inferring an expected output
Step 3: Differential testing
Evaluation
RQ1: Finding FTs for QuixBugs
Experiment setup
Results and findings
RQ2: Inferring Program Intention
...and 19 more sections

Figures (7)

Figure 1: An illustrative example for ChatGPT's weakness.
Figure 2: Workflow of Differential Prompting
Figure 3: Effectiveness of Differential Prompting and the baselines in finding failure-inducing test cases for buggy programs of QuixBugs. The vertical axis represents the number of test cases found by Differential Prompting or a baseline for a program subject in ten executions. The cross marks in the FT-IA column indicate the average number of FT-IA found by the three techniques.
Figure 4: Effectiveness of Differential Prompting and the baselines in finding failure-inducing test cases for correct programs of QuixBugs. The vertical axis represents the number of test cases found by Differential Prompting or a baseline for a program subject in ten executions.
Figure 5: ChatGPT's effectiveness in inferring program intention.
...and 2 more figures

Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

TL;DR

Abstract

Nuances are the Key: Unlocking ChatGPT to Find Failure-Inducing Tests with Differential Prompting

Authors

TL;DR

Abstract

Table of Contents

Figures (7)