Table of Contents
Fetching ...

ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty

Qing Zong, Zhaowei Wang, Tianshi Zheng, Xiyu Ren, Yangqiu Song

TL;DR

This work introduces ComparisonQA, a frequency-controlled benchmark that pairs high- and low-frequency entities under a shared abstract question to enable fair evaluation of LLM factual knowledge. It combines a large-scale automatic generation pipeline with a two-round robustness assessment that jointly analyzes correctness and model uncertainty to distinguish true knowledge from shortcuts. A notable finding is that even strong models like GPT-4o show poor robustness on low-frequency knowledge, while uncertainty-based filtering proves effective in curating a high-quality hard subset, ComparisonQA-Hard. The dataset and methodology offer a rigorous framework for probing knowledge frequency effects and guiding future improvements in factual accuracy under tail knowledge conditions.

Abstract

The rapid development of LLMs has sparked extensive research into their factual knowledge. Current works find that LLMs fall short on questions around low-frequency entities. However, such proofs are unreliable since the questions can differ not only in entity frequency but also in difficulty themselves. So we introduce ComparisonQA benchmark, containing 283K abstract questions, each instantiated by a pair of high-frequency and low-frequency entities. It ensures a controllable comparison to study the role of knowledge frequency in the performance of LLMs. Because the difference between such a pair is only the entity with different frequencies. In addition, we use both correctness and uncertainty to develop a two-round method to evaluate LLMs' knowledge robustness. It aims to avoid possible semantic shortcuts which is a serious problem of current QA study. Experiments reveal that LLMs, including GPT-4o, exhibit particularly low robustness regarding low-frequency knowledge. Besides, we find that uncertainty can be used to effectively identify high-quality and shortcut-free questions while maintaining the data size. Based on this, we propose an automatic method to select such questions to form a subset called ComparisonQA-Hard, containing only hard low-frequency questions.

ComparisonQA: Evaluating Factuality Robustness of LLMs Through Knowledge Frequency Control and Uncertainty

TL;DR

This work introduces ComparisonQA, a frequency-controlled benchmark that pairs high- and low-frequency entities under a shared abstract question to enable fair evaluation of LLM factual knowledge. It combines a large-scale automatic generation pipeline with a two-round robustness assessment that jointly analyzes correctness and model uncertainty to distinguish true knowledge from shortcuts. A notable finding is that even strong models like GPT-4o show poor robustness on low-frequency knowledge, while uncertainty-based filtering proves effective in curating a high-quality hard subset, ComparisonQA-Hard. The dataset and methodology offer a rigorous framework for probing knowledge frequency effects and guiding future improvements in factual accuracy under tail knowledge conditions.

Abstract

The rapid development of LLMs has sparked extensive research into their factual knowledge. Current works find that LLMs fall short on questions around low-frequency entities. However, such proofs are unreliable since the questions can differ not only in entity frequency but also in difficulty themselves. So we introduce ComparisonQA benchmark, containing 283K abstract questions, each instantiated by a pair of high-frequency and low-frequency entities. It ensures a controllable comparison to study the role of knowledge frequency in the performance of LLMs. Because the difference between such a pair is only the entity with different frequencies. In addition, we use both correctness and uncertainty to develop a two-round method to evaluate LLMs' knowledge robustness. It aims to avoid possible semantic shortcuts which is a serious problem of current QA study. Experiments reveal that LLMs, including GPT-4o, exhibit particularly low robustness regarding low-frequency knowledge. Besides, we find that uncertainty can be used to effectively identify high-quality and shortcut-free questions while maintaining the data size. Based on this, we propose an automatic method to select such questions to form a subset called ComparisonQA-Hard, containing only hard low-frequency questions.
Paper Structure (28 sections, 3 figures, 10 tables)

This paper contains 28 sections, 3 figures, 10 tables.

Figures (3)

  • Figure 1: An example from ComparisonQA
  • Figure 2: An overview of our benchmark curation pipeline. It contains three parts. Through the first two parts, (1) Entity Pairs Extraction from DBpedia and (2) Abstract Question Generation, we can get the whole ComparisonQA. And through the third part, (3) Hard High-Quality Question Selecting, we can get a harder subset, containing only difficult low-frequency questions with high quality and no semantic shortcut.
  • Figure 3: Heatmaps illustrating how subset quality changes with incorrect model number and high uncertainty remaining ratio. The former refers to the minimum number of times that each remaining question in the subset is answered incorrectly. The latter refers to the proportion of high-uncertainty questions that are retained in the subset.