ChatGPT Incorrectness Detection in Software Reviews

Minaoar Hossain Tanzil; Junaed Younus Khan; Gias Uddin

ChatGPT Incorrectness Detection in Software Reviews

Minaoar Hossain Tanzil, Junaed Younus Khan, Gias Uddin

TL;DR

The paper addresses the trustworthiness of ChatGPT in software engineering tasks by first surveying practitioners to reveal reliance and verification practices, then introducing CID, a black-box incorrectness detector that uses iterative, metamorphic prompting (ENQUIRER, CHALLENGER, DECIDER). CID detects inconsistencies across contextually similar but textually divergent prompts, achieving an $F1$-score of $0.74$-$0.75$ (accuracy $0.75$) in a software library selection benchmark. The approach relies on structured data collection, explanation labeling, and a 24-feature consistency-based detection model, with mutation-based challenges driving performance. The work highlights practical value for automated verification of LLM outputs in SE and outlines paths for generalizing CID to other SE tasks and broader use.

Abstract

We conducted a survey of 135 software engineering (SE) practitioners to understand how they use Generative AI-based chatbots like ChatGPT for SE tasks. We find that they want to use ChatGPT for SE tasks like software library selection but often worry about the truthfulness of ChatGPT responses. We developed a suite of techniques and a tool called CID (ChatGPT Incorrectness Detector) to automatically test and detect the incorrectness in ChatGPT responses. CID is based on the iterative prompting to ChatGPT by asking it contextually similar but textually divergent questions (using an approach that utilizes metamorphic relationships in texts). The underlying principle in CID is that for a given question, a response that is different from other responses (across multiple incarnations of the question) is likely an incorrect response. In a benchmark study of library selection, we show that CID can detect incorrect responses from ChatGPT with an F1-score of 0.74 - 0.75.

ChatGPT Incorrectness Detection in Software Reviews

TL;DR

-score of

(accuracy

) in a software library selection benchmark. The approach relies on structured data collection, explanation labeling, and a 24-feature consistency-based detection model, with mutation-based challenges driving performance. The work highlights practical value for automated verification of LLM outputs in SE and outlines paths for generalizing CID to other SE tasks and broader use.

Abstract

Paper Structure (32 sections, 3 figures, 6 tables)

This paper contains 32 sections, 3 figures, 6 tables.

Introduction
Survey to Assess Software Developer Perspectives on ChatGPT Usage
Survey Setup
Survey Questions
Survey Participants
Reasons for using ChatGPT (RQ1)
Concerns about ChatGPT Responses (RQ2)
Verification of ChatGPT Responses (RQ3)
CID: An Automatic ChatGPT Incorrectness Detector
ENQUIRER
CHALLENGER
Basic Challenger
Mutation Challenger
DECIDER
Dataset Creation.
...and 17 more sections

Figures (3)

Figure 1: Overview of CID Tool.
Figure 2: Metamorphic Relations (MRs) used in the Mutation Challenger to mutate questions.
Figure 3: Sources and categories of misclassification.

ChatGPT Incorrectness Detection in Software Reviews

TL;DR

Abstract

ChatGPT Incorrectness Detection in Software Reviews

Authors

TL;DR

Abstract

Table of Contents

Figures (3)