WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations

Haolin Deng; Chang Wang; Xin Li; Dezhang Yuan; Junlang Zhan; Tianhua Zhou; Jin Ma; Jun Gao; Ruifeng Xu

WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations

Haolin Deng, Chang Wang, Xin Li, Dezhang Yuan, Junlang Zhan, Tianhua Zhou, Jin Ma, Jun Gao, Ruifeng Xu

TL;DR

This work introduces attributed query-focused summarization (AQFS) and WebCiteS, a Chinese dataset with ~7k human-annotated summaries and inline citations derived from real user queries and web results. It distinguishes groundedness from citation quality via a new evaluation framework built on a claim-split model and NLI-based verification, enabling fine-grained assessment of partial support and citation accuracy. An automatic evaluator is developed and validated against human annotations, and extensive experiments with open-source and proprietary LLMs reveal persistent attribution challenges, especially in long-context and fine-grained document settings. The study demonstrates that supervised fine-tuning improves attribution but also highlights the need for more precise evidence localization and robust citation mechanisms to ensure reliable source-grounded generations in practical retrieval-augmented systems.

Abstract

Enhancing the attribution in large language models (LLMs) is a crucial task. One feasible approach is to enable LLMs to cite external sources that support their generations. However, existing datasets and evaluation methods in this domain still exhibit notable limitations. In this work, we formulate the task of attributed query-focused summarization (AQFS) and present WebCiteS, a Chinese dataset featuring 7k human-annotated summaries with citations. WebCiteS derives from real-world user queries and web search results, offering a valuable resource for model training and evaluation. Prior works in attribution evaluation do not differentiate between groundedness errors and citation errors. They also fall short in automatically verifying sentences that draw partial support from multiple sources. We tackle these issues by developing detailed metrics and enabling the automatic evaluator to decompose the sentences into sub-claims for fine-grained verification. Our comprehensive evaluation of both open-source and proprietary models on WebCiteS highlights the challenge LLMs face in correctly citing sources, underscoring the necessity for further improvement. The dataset and code will be open-sourced to facilitate further research in this crucial field.

WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations

TL;DR

Abstract

Paper Structure (73 sections, 4 equations, 7 figures, 13 tables)

This paper contains 73 sections, 4 equations, 7 figures, 13 tables.

Introduction
Firstly, most existing datasets are deficient in high-quality citation annotations.
Secondly, current evaluation methods are insufficient to thoroughly assess attribution.
The WebCiteS Dataset
Task Formulation of AQFS
Data Collection
Human-LLM Collaborative Annotation
Stage 1: Manual Screening and Information Extraction.
Stage 2: LLM-based Candidate Summary Generation.
Stage 3: Manual Refinement and Citation Annotation.
Quality control.
Are the retrieved documents useful to the queries?
How much manual refinement is made on candidate summaries?
Overlap of web pages.
Evaluation Framework
...and 58 more sections

Figures (7)

Figure 1: Illustration of attributed query-focused summarization (AQFS). Full example is shown in Table \ref{['tab:fsp_prompt']}.
Figure 2: Illustration of the human-LLM collaborative annotation pipeline of WebCiteS. Initially, annotators manually extract useful information from the documents; then, LLMs are used to generate candidate summaries from the extraction; finally, annotators choose the preferred candidate, refine its quality, and annotate citations.
Figure 3: Illustration of our attribution evaluation. We use a claim-split model to extract sub-claims of a sentence and conduct fine-grained verification on all the source documents. The translation is in italic text.
Figure 4: Performance change over context length of the models in Table \ref{['tab:fulldoc']}, where full content of web pages are chunked into documents with a maximum length of 512. Model names are followed by their context window size. The number of input tokens is counted using the tokenizer of each model respectively.
Figure 5: The distribution of the number of citations per sentence and summary in WebCiteS.
...and 2 more figures

WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations

TL;DR

Abstract

WebCiteS: Attributed Query-Focused Summarization on Chinese Web Search Results with Citations

Authors

TL;DR

Abstract

Table of Contents

Figures (7)