Table of Contents
Fetching ...

ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet?

Salma Begum Tamanna, Gias Uddin, Song Wang, Lan Xia, Longyu Zhang

TL;DR

CHIME (ChatGPT Inaccuracy Mitigation Engine) is presented whose underlying principle is that if it can preprocess the technical reports better and guide the query validation process in ChatGPT, it can address the observed limitations.

Abstract

Hallucinations, the tendency to produce irrelevant/incorrect responses, are prevalent concerns in generative AI-based tools like ChatGPT. Although hallucinations in ChatGPT are studied for textual responses, it is unknown how ChatGPT hallucinates for technical texts that contain both textual and technical terms. We surveyed 47 software engineers and produced a benchmark of 412 Q&A pairs from the bug reports of two OSS projects. We find that a RAG-based ChatGPT (i.e., ChatGPT tuned with the benchmark issue reports) is 36.4% correct when producing answers to the questions, due to two reasons 1) limitations to understand complex technical contents in code snippets like stack traces, and 2) limitations to integrate contexts denoted in the technical terms and texts. We present CHIME (ChatGPT Inaccuracy Mitigation Engine) whose underlying principle is that if we can preprocess the technical reports better and guide the query validation process in ChatGPT, we can address the observed limitations. CHIME uses context-free grammar (CFG) to parse stack traces in technical reports. CHIME then verifies and fixes ChatGPT responses by applying metamorphic testing and query transformation. In our benchmark, CHIME shows 30.3% more correction over ChatGPT responses. In a user study, we find that the improved responses with CHIME are considered more useful than those generated from ChatGPT without CHIME.

ChatGPT Inaccuracy Mitigation during Technical Report Understanding: Are We There Yet?

TL;DR

CHIME (ChatGPT Inaccuracy Mitigation Engine) is presented whose underlying principle is that if it can preprocess the technical reports better and guide the query validation process in ChatGPT, it can address the observed limitations.

Abstract

Hallucinations, the tendency to produce irrelevant/incorrect responses, are prevalent concerns in generative AI-based tools like ChatGPT. Although hallucinations in ChatGPT are studied for textual responses, it is unknown how ChatGPT hallucinates for technical texts that contain both textual and technical terms. We surveyed 47 software engineers and produced a benchmark of 412 Q&A pairs from the bug reports of two OSS projects. We find that a RAG-based ChatGPT (i.e., ChatGPT tuned with the benchmark issue reports) is 36.4% correct when producing answers to the questions, due to two reasons 1) limitations to understand complex technical contents in code snippets like stack traces, and 2) limitations to integrate contexts denoted in the technical terms and texts. We present CHIME (ChatGPT Inaccuracy Mitigation Engine) whose underlying principle is that if we can preprocess the technical reports better and guide the query validation process in ChatGPT, we can address the observed limitations. CHIME uses context-free grammar (CFG) to parse stack traces in technical reports. CHIME then verifies and fixes ChatGPT responses by applying metamorphic testing and query transformation. In our benchmark, CHIME shows 30.3% more correction over ChatGPT responses. In a user study, we find that the improved responses with CHIME are considered more useful than those generated from ChatGPT without CHIME.

Paper Structure

This paper contains 32 sections, 1 equation, 9 figures, 9 tables.

Figures (9)

  • Figure 1: (a) Frequency & (b) Reasons of Bug Report Exploration
  • Figure 2: (a) Interest for T1 (b) Usefulness Perception of T1
  • Figure 3: (a) Interest for T2 (b) Usefulness Perception of T2
  • Figure 4: (a) Interest for T3 (b) Usefulness Perception of T3
  • Figure 5: (a) Interest for T4 (b) Usefulness Perception of T4
  • ...and 4 more figures