SourceBench: Can AI Answers Reference Quality Web Sources?

Hexi Jin; Stephen Liu; Yuheng Li; Simran Malik; Yiying Zhang

SourceBench: Can AI Answers Reference Quality Web Sources?

Hexi Jin, Stephen Liu, Yuheng Li, Simran Malik, Yiying Zhang

TL;DR

This work introduces SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents, and reveals four key new insights that can guide future research in the direction of GenAI and web search.

Abstract

Large language models (LLMs) increasingly answer queries by citing web sources, but existing evaluations emphasize answer correctness rather than evidence quality. We introduce SourceBench, a benchmark for measuring the quality of cited web sources across 100 real-world queries spanning informational, factual, argumentative, social, and shopping intents. SourceBench uses an eight-metric framework covering content quality (content relevance, factual accuracy, objectivity) and page-level signals (e.g., freshness, authority/accountability, clarity), and includes a human-labeled dataset with a calibrated LLM-based evaluator that matches expert judgments closely. We evaluate eight LLMs, Google Search, and three AI search tools over 3996 cited sources using SourceBench and conduct further experiments to understand the evaluation results. Overall, our work reveals four key new insights that can guide future research in the direction of GenAI and web search.

SourceBench: Can AI Answers Reference Quality Web Sources?

TL;DR

Abstract

Paper Structure (26 sections, 2 figures, 9 tables)

This paper contains 26 sections, 2 figures, 9 tables.

Introduction
Motivation and Related Work
Search and GenAI
Related Benchmarks and Evaluation
SourceBench
Request Collection
Multi-Facet Source Quality Metrics
Evaluator
Source Collection.
Human Labeling
LLM Evaluator
Evaluation Results
Evaluating Systems
Results
Correlation of Metrics
...and 11 more sections

Figures (2)

Figure 1: Correlation between Metrics. Heatmap of correlation between different metrics; results from all evaluating systems are aggregated.
Figure 2: Correlation in HotpotQA.

SourceBench: Can AI Answers Reference Quality Web Sources?

TL;DR

Abstract

SourceBench: Can AI Answers Reference Quality Web Sources?

Authors

TL;DR

Abstract

Table of Contents

Figures (2)