GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence
Kundan Krishna, Sanjana Ramprasad, Prakhar Gupta, Byron C. Wallace, Zachary C. Lipton, Jeffrey P. Bigham
TL;DR
The paper tackles factual errors in LLM outputs for document-grounded tasks by introducing GenAudit, an interactive fact-checking tool that locates unsupported claims, proposes minimal edits, and displays supporting evidence from reference documents. It trains backend fact-checking models on the USB dataset using a sequence-to-sequence formulation and employs memory-efficient techniques such as 4-bit quantization with low-rank adapters and iterative document reduction. Across eight models and three domains, GenAudit achieves high evidence-precision and strong evidence recall, while human studies show substantial improvements in error-detection performance when assisted by the tool. A thresholded decoding strategy is proposed to boost recall with a controlled precision trade-off, and the tool along with the models is released publicly for broader use and evaluation.
Abstract
LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. User studies demonstrate that using GenAudit can substantially improve the performance of humans at finding errors in LLM-generated summaries. We release our tool (GenAudit) and fact-checking model for public use.
