Table of Contents
Fetching ...

Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation

Hexuan Wang, Jingyu Zhang, Benjamin Van Durme, Daniel Khashabi

Abstract

Citation granularity - whether to cite individual sentences, paragraphs, or documents - is a critical design choice in attributed generation. While fine-grained citations are often preferred for precise human verification, their impact on model performance remains under-explored. We analyze four model scales (8B-120B) and demonstrate that enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity. We observe a consistent performance pattern where attribution quality peaks at intermediate granularities (paragraph-level). Our analysis suggests that fine-grained (sentence-level) citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively coarse citations (multi-paragraph) introduce distracting noise. Importantly, the magnitude of this performance gap varies non-monotonically with model scale: fine-grained constraints disproportionately penalize larger models, suggesting that atomic citation units disrupt the multi-sentence information synthesis at which these models excel. Strikingly, citation-optimal granularity leads to substantial gains in attribution quality while preserving or even improving answer correctness. Overall, our findings demonstrate that optimizing solely for human verification via fine-grained citation disregards model constraints, compromising both attribution faithfulness and generation reliability. Instead, effective attribution requires aligning citation granularity with the model's natural semantic scope.

Are Finer Citations Always Better? Rethinking Granularity for Attributed Generation

Abstract

Citation granularity - whether to cite individual sentences, paragraphs, or documents - is a critical design choice in attributed generation. While fine-grained citations are often preferred for precise human verification, their impact on model performance remains under-explored. We analyze four model scales (8B-120B) and demonstrate that enforcing fine-grained citations degrades attribution quality by 16-276% compared to the best-performing granularity. We observe a consistent performance pattern where attribution quality peaks at intermediate granularities (paragraph-level). Our analysis suggests that fine-grained (sentence-level) citations disrupt necessary semantic dependencies for attributing evidence to answer claims, while excessively coarse citations (multi-paragraph) introduce distracting noise. Importantly, the magnitude of this performance gap varies non-monotonically with model scale: fine-grained constraints disproportionately penalize larger models, suggesting that atomic citation units disrupt the multi-sentence information synthesis at which these models excel. Strikingly, citation-optimal granularity leads to substantial gains in attribution quality while preserving or even improving answer correctness. Overall, our findings demonstrate that optimizing solely for human verification via fine-grained citation disregards model constraints, compromising both attribution faithfulness and generation reliability. Instead, effective attribution requires aligning citation granularity with the model's natural semantic scope.

Paper Structure

This paper contains 68 sections, 11 figures, 15 tables.

Figures (11)

  • Figure 1: Impact of Citation Granularity. The model processes a long context with many chunks and generates an answer with supporting citations. We show only the chunks relevant to the query. At fine granularity (Left), sentence-level chunks (C6-C8) cannot independently support the statement. Although the generated answer is correct, attribution fails because fine-grained chunking isolates the subject ("Alberic III") from the relationship. The model selects the citation chunk containing the subject anchor (C6), but this evidence is incomplete without the dependency chain. In contrast, coarse granularity (Right) merges these sentences into paragraph-level chunk C3, preserving the full dependency required for a supportive citation.
  • Figure 2: The Granularity-Performance Curves. We track Citation F1 across granularity settings for three representative citation volume ranges. Across all four evaluated model scales, performance is consistently lowest at fine granularity ($k{=}1,2$) and peaks at intermediate settings before declining or plateauing at coarse settings.
  • Figure 3: Citation F1 Decomposition (GPT-120B, $\boldsymbol{16 \le V \le 31}$).F1 (red) peaks by balancing opposing trends: Precision (green) improves with granularity (better context), while Recall (blue) degrades at coarse settings ($k=16$) due to noise.
  • Figure 4: Asymmetric Sensitivity (Llama-70B). Lines depict relative change from the fine-grained baseline ($k{=}1$). Attribution quality (Citation F1, red) responds strongly to granularity ($>25\%$ gain), whereas answer correctness (blue) is unresponsive ($<1\%$ change).
  • Figure 5: Citation Volume Distributions. Histograms show statement counts per volume range for each granularity setting. Missing bars reflect the structural constraints described in §\ref{['app:methodology']}. Comparisons in the main text are made strictly within vertical slices (fixed volume ranges).
  • ...and 6 more figures