Table of Contents
Fetching ...

AuthorityBench: Benchmarking LLM Authority Perception for Reliable Retrieval-Augmented Generation

Zhihui Yao, Hengran Zhang, Keping Bi

Abstract

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) with external knowledge but remains vulnerable to low-authority sources that can propagate misinformation. We investigate whether LLMs can perceive information authority - a capability extending beyond semantic understanding. To address this, we introduce AuthorityBench, a comprehensive benchmark for evaluating LLM authority perception comprising three datasets: DomainAuth (10K web domains with PageRank-based authority), EntityAuth (22K entities with popularity-based authority), and RAGAuth (120 queries with documents of varying authority for downstream evaluation). We evaluate five LLMs using three judging methods (PointJudge, PairJudge, ListJudge) across multiple output formats. Results show that ListJudge and PairJudge with PointScore output achieve the strongest correlation with ground-truth authority, while ListJudge offers optimal cost-effectiveness. Notably, incorporating webpage text consistently degrades judgment performance, suggesting authority is distinct from textual style. Downstream experiments on RAG demonstrate that authority-guided filtering largely improves answer accuracy, validating the practical importance of authority perception for reliable knowledge retrieval. Code and benchmark are available at: https://github.com/Trustworthy-Information-Access/AuthorityBench.

AuthorityBench: Benchmarking LLM Authority Perception for Reliable Retrieval-Augmented Generation

Abstract

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) with external knowledge but remains vulnerable to low-authority sources that can propagate misinformation. We investigate whether LLMs can perceive information authority - a capability extending beyond semantic understanding. To address this, we introduce AuthorityBench, a comprehensive benchmark for evaluating LLM authority perception comprising three datasets: DomainAuth (10K web domains with PageRank-based authority), EntityAuth (22K entities with popularity-based authority), and RAGAuth (120 queries with documents of varying authority for downstream evaluation). We evaluate five LLMs using three judging methods (PointJudge, PairJudge, ListJudge) across multiple output formats. Results show that ListJudge and PairJudge with PointScore output achieve the strongest correlation with ground-truth authority, while ListJudge offers optimal cost-effectiveness. Notably, incorporating webpage text consistently degrades judgment performance, suggesting authority is distinct from textual style. Downstream experiments on RAG demonstrate that authority-guided filtering largely improves answer accuracy, validating the practical importance of authority perception for reliable knowledge retrieval. Code and benchmark are available at: https://github.com/Trustworthy-Information-Access/AuthorityBench.

Paper Structure

This paper contains 27 sections, 1 equation, 6 figures, 12 tables.

Figures (6)

  • Figure 1: An illustration of the authority perception challenge in RAG. When faced with conflicting information from sources of varying authority (e.g., a high-authority medical institution like Mayo Clinic vs. lower-authority lifestyle blogs), an LLM must correctly discern which source to trust to provide a reliable answer. Our work investigates this capability.
  • Figure 2: An overview of AuthorityBench. It consists of three sub-tasks: DomainAuth for source authority, EntityAuth for entity authority, and RAGAuth for downstream RAG evaluation. The outer rings show the topic distribution within each dataset.
  • Figure 3: Authority score distribution in DomainAuth and EntityAuth.
  • Figure 4: Authority-aware RAG pipeline: retrieved documents are LLM-scored; top-$k$ go to the generator. Filtering signals: (a) w/o Filter; (b) Relevance Filter; (c) Utility Filter; (d) Authority Filter (source URL only; no document content). All prompts in this section are provided in Appendix \ref{['sec:prompts']}.
  • Figure 5: An example from the RAGAuth dataset. Each instance includes a yes/no question, the ground-truth answer, and a list of 10 retrieved documents with their source URL, domain, PageRank score, and text snippet. The task is to generate a correct answer based on this information.
  • ...and 1 more figures