Table of Contents
Fetching ...

UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions

Chuanyuan Tan, Wenbiao Shao, Hao Xiong, Tong Zhu, Zhenhua Liu, Kai Shi, Wenliang Chen

TL;DR

UAQFact introduces a bilingual, KG-backed unanswerable-question dataset designed to evaluate how LLMs leverage internal and external factual knowledge. It pairs UAQs and ABQs with auxiliary Wikidata-derived knowledge and reasoning clues, enabling three tasks that separately test discrimination, internal knowledge utilization, and external knowledge integration. Across multiple LLM families and languages, results show persistent gaps in effectively using stored knowledge, with external knowledge offering gains but not full exploitation. The work highlights the need for improved mechanisms to activate internal facts and to integrate external knowledge when addressing UAQ, and it provides a scalable framework for future multilingual, knowledge-aware evaluation.

Abstract

Handling unanswerable questions (UAQ) is crucial for LLMs, as it helps prevent misleading responses in complex situations. While previous studies have built several datasets to assess LLMs' performance on UAQ, these datasets lack factual knowledge support, which limits the evaluation of LLMs' ability to utilize their factual knowledge when handling UAQ. To address the limitation, we introduce a new unanswerable question dataset UAQFact, a bilingual dataset with auxiliary factual knowledge created from a Knowledge Graph. Based on UAQFact, we further define two new tasks to measure LLMs' ability to utilize internal and external factual knowledge, respectively. Our experimental results across multiple LLM series show that UAQFact presents significant challenges, as LLMs do not consistently perform well even when they have factual knowledge stored. Additionally, we find that incorporating external knowledge may enhance performance, but LLMs still cannot make full use of the knowledge which may result in incorrect responses.

UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions

TL;DR

UAQFact introduces a bilingual, KG-backed unanswerable-question dataset designed to evaluate how LLMs leverage internal and external factual knowledge. It pairs UAQs and ABQs with auxiliary Wikidata-derived knowledge and reasoning clues, enabling three tasks that separately test discrimination, internal knowledge utilization, and external knowledge integration. Across multiple LLM families and languages, results show persistent gaps in effectively using stored knowledge, with external knowledge offering gains but not full exploitation. The work highlights the need for improved mechanisms to activate internal facts and to integrate external knowledge when addressing UAQ, and it provides a scalable framework for future multilingual, knowledge-aware evaluation.

Abstract

Handling unanswerable questions (UAQ) is crucial for LLMs, as it helps prevent misleading responses in complex situations. While previous studies have built several datasets to assess LLMs' performance on UAQ, these datasets lack factual knowledge support, which limits the evaluation of LLMs' ability to utilize their factual knowledge when handling UAQ. To address the limitation, we introduce a new unanswerable question dataset UAQFact, a bilingual dataset with auxiliary factual knowledge created from a Knowledge Graph. Based on UAQFact, we further define two new tasks to measure LLMs' ability to utilize internal and external factual knowledge, respectively. Our experimental results across multiple LLM series show that UAQFact presents significant challenges, as LLMs do not consistently perform well even when they have factual knowledge stored. Additionally, we find that incorporating external knowledge may enhance performance, but LLMs still cannot make full use of the knowledge which may result in incorrect responses.

Paper Structure

This paper contains 43 sections, 1 equation, 3 figures, 17 tables.

Figures (3)

  • Figure 1: Dataset Construction Process (QType Inter in English as an example) for unanswerable question (UAQ) and answerable question (ABQ): (1) Define the question type. (2) Sample factual triples from Wikidata as knowledge. (3) Generate questions by filling in the templates generated by LLM. (4) Define three tasks and compose unique inputs with factual knowledge from the preceding steps as references.
  • Figure 2: Refusal rate and Acc evaluated in Task 1 of Qwen2.5 series with parameters scaling from 0.5B to 72B. Detailed results are shown in Appendix \ref{['sec_app_task1_scaling']}.
  • Figure 3: R$_{\Delta}$ Comparison Between Base and EKnow. EN and ZH are abbreviations for English and Chinese, respectively. Detailed results are shown in Appendix \ref{['sec_app_task3']}.