Table of Contents
Fetching ...

Do Large Language Models Know about Facts?

Xuming Hu, Junzhe Chen, Xiaochuan Li, Yufei Guo, Lijie Wen, Philip S. Yu, Zhijiang Guo

TL;DR

This paper tackles whether large language models truly memorize and reason over factual knowledge. It introduces Pinocchio, a sizable benchmark with 20,713 multiple-choice questions across seven tasks that probe multifaceted, structured, adversarial, temporal, real-world, domain-specific, and multilingual facts. Through extensive experiments on 10 accessible LLMs using zero-shot, few-shot, and reasoning-enhanced prompts, the authors show that even instruction-tuned and RLHF models lag in factual accuracy and rely on spurious correlations. The authors provide a nuanced analysis of multi-hop reasoning, structured data handling, temporal updates, adversarial robustness, and multilingual transfer, highlighting key gaps and directions for future work. Pinocchio, along with the accompanying code, aims to catalyze the development of more trustworthy and up-to-date factual knowledge in LLMs.

Abstract

Large language models (LLMs) have recently driven striking performance improvements across a range of natural language processing tasks. The factual knowledge acquired during pretraining and instruction tuning can be useful in various downstream tasks, such as question answering, and language generation. Unlike conventional Knowledge Bases (KBs) that explicitly store factual knowledge, LLMs implicitly store facts in their parameters. Content generated by the LLMs can often exhibit inaccuracies or deviations from the truth, due to facts that can be incorrectly induced or become obsolete over time. To this end, we aim to comprehensively evaluate the extent and scope of factual knowledge within LLMs by designing the benchmark Pinocchio. Pinocchio contains 20K diverse factual questions that span different sources, timelines, domains, regions, and languages. Furthermore, we investigate whether LLMs are able to compose multiple facts, update factual knowledge temporally, reason over multiple pieces of facts, identify subtle factual differences, and resist adversarial examples. Extensive experiments on different sizes and types of LLMs show that existing LLMs still lack factual knowledge and suffer from various spurious correlations. We believe this is a critical bottleneck for realizing trustworthy artificial intelligence. The dataset Pinocchio and our codes will be publicly available.

Do Large Language Models Know about Facts?

TL;DR

This paper tackles whether large language models truly memorize and reason over factual knowledge. It introduces Pinocchio, a sizable benchmark with 20,713 multiple-choice questions across seven tasks that probe multifaceted, structured, adversarial, temporal, real-world, domain-specific, and multilingual facts. Through extensive experiments on 10 accessible LLMs using zero-shot, few-shot, and reasoning-enhanced prompts, the authors show that even instruction-tuned and RLHF models lag in factual accuracy and rely on spurious correlations. The authors provide a nuanced analysis of multi-hop reasoning, structured data handling, temporal updates, adversarial robustness, and multilingual transfer, highlighting key gaps and directions for future work. Pinocchio, along with the accompanying code, aims to catalyze the development of more trustworthy and up-to-date factual knowledge in LLMs.

Abstract

Large language models (LLMs) have recently driven striking performance improvements across a range of natural language processing tasks. The factual knowledge acquired during pretraining and instruction tuning can be useful in various downstream tasks, such as question answering, and language generation. Unlike conventional Knowledge Bases (KBs) that explicitly store factual knowledge, LLMs implicitly store facts in their parameters. Content generated by the LLMs can often exhibit inaccuracies or deviations from the truth, due to facts that can be incorrectly induced or become obsolete over time. To this end, we aim to comprehensively evaluate the extent and scope of factual knowledge within LLMs by designing the benchmark Pinocchio. Pinocchio contains 20K diverse factual questions that span different sources, timelines, domains, regions, and languages. Furthermore, we investigate whether LLMs are able to compose multiple facts, update factual knowledge temporally, reason over multiple pieces of facts, identify subtle factual differences, and resist adversarial examples. Extensive experiments on different sizes and types of LLMs show that existing LLMs still lack factual knowledge and suffer from various spurious correlations. We believe this is a critical bottleneck for realizing trustworthy artificial intelligence. The dataset Pinocchio and our codes will be publicly available.
Paper Structure (30 sections, 8 figures, 8 tables)

This paper contains 30 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Pinocchio is a comprehensive dataset that tackles 7 distinct tasks related to factual knowledge and reasoning. It consists of 20,713 multiple-choice questions that have been sourced from various reliable and diverse channels.
  • Figure 2: Illustration of prompts using different settings.
  • Figure 3: GPT-3.5-Turbo's outcomes across three distinct tasks under Few-shot CoT setting.
  • Figure 4: Results of GPT-3.5-Turbo in three different tasks under Few-shot CoT setting.
  • Figure 5: Prompts of four different settings.
  • ...and 3 more figures