Table of Contents
Fetching ...

FactGuard: Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction

Qian-Wen Zhang, Fang Li, Jie Wang, Lingfeng Qiao, Yifei Yu, Di Yin, Xing Sun

TL;DR

FactGuard tackles the challenge of robustly answering and recognizing unanswerable questions in long-context reading comprehension by introducing a multi-agent data augmentation framework that autonomously generates evidence-grounded QA pairs and unanswerable cases. The authors build FactGuard-Bench, a long-context benchmark with 25,220 examples spanning 8K to 128K tokens across English and Chinese texts in law and books, and demonstrate substantial gaps in current LLMs’ handling of unanswerable queries. Through a three-stage pipeline (preparation, QA generation, negative example generation) and fine-tuning with SFT and DPO, the approach yields substantial improvements in answer-consistency and reasoning about unanswerable questions, achieving up to 82.39% overall accuracy on an 8B model. The work provides a scalable framework and benchmark for advancing long-context QA, highlighting the need for explicit unanswerable reasoning to curb hallucinations and improve information integrity in LLMs.

Abstract

Extractive reading comprehension systems are designed to locate the correct answer to a question within a given text. However, a persistent challenge lies in ensuring these models maintain high accuracy in answering questions while reliably recognizing unanswerable queries. Despite significant advances in large language models (LLMs) for reading comprehension, this issue remains critical, particularly as the length of supported contexts continues to expand. To address this challenge, we propose an innovative data augmentation methodology grounded in a multi-agent collaborative framework. Unlike traditional methods, such as the costly human annotation process required for datasets like SQuAD 2.0, our method autonomously generates evidence-based question-answer pairs and systematically constructs unanswerable questions. Using this methodology, we developed the FactGuard-Bench dataset, which comprises 25,220 examples of both answerable and unanswerable question scenarios, with context lengths ranging from 8K to 128K. Experimental evaluations conducted on seven popular LLMs reveal that even the most advanced models achieve only 61.79% overall accuracy. Furthermore, we emphasize the importance of a model's ability to reason about unanswerable questions to avoid generating plausible but incorrect answers. By implementing efficient data selection and generation within the multi-agent collaborative framework, our method significantly reduces the traditionally high costs associated with manual annotation and provides valuable insights for the training and optimization of LLMs.

FactGuard: Leveraging Multi-Agent Systems to Generate Answerable and Unanswerable Questions for Enhanced Long-Context LLM Extraction

TL;DR

FactGuard tackles the challenge of robustly answering and recognizing unanswerable questions in long-context reading comprehension by introducing a multi-agent data augmentation framework that autonomously generates evidence-grounded QA pairs and unanswerable cases. The authors build FactGuard-Bench, a long-context benchmark with 25,220 examples spanning 8K to 128K tokens across English and Chinese texts in law and books, and demonstrate substantial gaps in current LLMs’ handling of unanswerable queries. Through a three-stage pipeline (preparation, QA generation, negative example generation) and fine-tuning with SFT and DPO, the approach yields substantial improvements in answer-consistency and reasoning about unanswerable questions, achieving up to 82.39% overall accuracy on an 8B model. The work provides a scalable framework and benchmark for advancing long-context QA, highlighting the need for explicit unanswerable reasoning to curb hallucinations and improve information integrity in LLMs.

Abstract

Extractive reading comprehension systems are designed to locate the correct answer to a question within a given text. However, a persistent challenge lies in ensuring these models maintain high accuracy in answering questions while reliably recognizing unanswerable queries. Despite significant advances in large language models (LLMs) for reading comprehension, this issue remains critical, particularly as the length of supported contexts continues to expand. To address this challenge, we propose an innovative data augmentation methodology grounded in a multi-agent collaborative framework. Unlike traditional methods, such as the costly human annotation process required for datasets like SQuAD 2.0, our method autonomously generates evidence-based question-answer pairs and systematically constructs unanswerable questions. Using this methodology, we developed the FactGuard-Bench dataset, which comprises 25,220 examples of both answerable and unanswerable question scenarios, with context lengths ranging from 8K to 128K. Experimental evaluations conducted on seven popular LLMs reveal that even the most advanced models achieve only 61.79% overall accuracy. Furthermore, we emphasize the importance of a model's ability to reason about unanswerable questions to avoid generating plausible but incorrect answers. By implementing efficient data selection and generation within the multi-agent collaborative framework, our method significantly reduces the traditionally high costs associated with manual annotation and provides valuable insights for the training and optimization of LLMs.

Paper Structure

This paper contains 28 sections, 4 figures, 8 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of FactGuard for data synthesis in a multi-agent collaboration framework.
  • Figure 2: For misleading negative example generation, the percentage of attrition in FactGuard's data processing program.
  • Figure 3: Distributions of FactGuard-Bench in terms of domain, question type and length.
  • Figure 4: Case study. An examples of answerable questions in English on the left, an example of lack of evidence in English in the center, and an example of misleading evidence in Chinese on the right (translated below). Red underlined text indicates hallucinatory content and green italicized text indicates useful explanations.