GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Kundan Krishna; Sanjana Ramprasad; Prakhar Gupta; Byron C. Wallace; Zachary C. Lipton; Jeffrey P. Bigham

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C. Wallace, Zachary C. Lipton, Jeffrey P. Bigham

TL;DR

The paper tackles factual errors in LLM outputs for document-grounded tasks by introducing GenAudit, an interactive fact-checking tool that locates unsupported claims, proposes minimal edits, and displays supporting evidence from reference documents. It trains backend fact-checking models on the USB dataset using a sequence-to-sequence formulation and employs memory-efficient techniques such as 4-bit quantization with low-rank adapters and iterative document reduction. Across eight models and three domains, GenAudit achieves high evidence-precision and strong evidence recall, while human studies show substantial improvements in error-detection performance when assisted by the tool. A thresholded decoding strategy is proposed to boost recall with a controlled precision trade-off, and the tool along with the models is released publicly for broader use and evaluation.

Abstract

LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. User studies demonstrate that using GenAudit can substantially improve the performance of humans at finding errors in LLM-generated summaries. We release our tool (GenAudit) and fact-checking model for public use.

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

TL;DR

Abstract

Paper Structure (19 sections, 7 equations, 9 figures, 9 tables)

This paper contains 19 sections, 7 equations, 9 figures, 9 tables.

Introduction
Background
The USB dataset
Reducing memory requirement for training
System Overview
Experiments
Human Evaluation
Impact on human performance at fact-checking
Improving Recall of Error Detection
Related Work
Conclusion
Appendix
Binary Classification of Factuality
Details on human evaluation (for Section \ref{['sec:humaneval']})
Annotator Instructions (for Section \ref{['sec:humaneval']})
...and 4 more sections

Figures (9)

Figure 1: An illustration of GenAudit's user interface and sample predictions. Reference document (a clinical transcript) is on the left and the generated text to be fact-checked is on the right (generated by querying any LLM, but manually entered here for ease of illustration). Spans in the text which are not supported or are contradicted by the reference are highlighted in red, with suggested replacements in green. As the user moves to any line in the generated text, evidence found for all facts in it are highlighted using blue links. Evidence and error predictions shown here are made by a fine-tuned Flan-UL2 model backend.
Figure 2: Overview of the Genaudit interface. The reference document (A) is displayed alongside the model generated output (B), allowing for easy comparison. Users can initiate the fact-checking process by clicking the fact-checking icon next to each output sentence (C), which highlights potential factual inconsistencies. Evidence markers (D) indicate supporting or contradicting evidence from the reference document. Hovering over an evidence marker highlights the relevant section in the reference (E), and clicking on it scrolls to the corresponding location in the document. Hallucinated spans in the generated output are marked in red (F), with the system providing suggestions for real-time correction
Figure 3: Variation in precision and recall of error identification by a fine-tuned Flan-UL2 model when using thresholded editing (Algorithm \ref{['algo:threshedit']}), versus editing out additional tokens either at random or by selecting the ones with low probability.
Figure 4: Prompt template used for introducing factual errors in summaries used for human fact-checking performance evaluation (Section \ref{['sec:efficiency_eval']}). Taken from laban2023summedits
Figure 5: Interface used for collecting feedback on suggested evidence and edits from GenAudit. Annotators can accept/reject each suggested evidence sentence, and can also mark additional sentences as evidence if needed. Suggested edits can be accepted/rejected by clicking on the button on the top-right of the highlighted span. if needed, users can also make freeform edits to fix more errors. Annotators cycle through the summaries generated by different models, whose names are anonymized and their order is shuffled.
...and 4 more figures

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

TL;DR

Abstract

GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence

Authors

TL;DR

Abstract

Table of Contents

Figures (9)