Where did you get that? Towards Summarization Attribution for Analysts
Violet B, John M. Conroy, Sean Lynch, Danielle M, Neil P. Molino, Aaron Wiechmann, Julia S. Yang
TL;DR
This work tackles attribution in analyst-focused automatic summaries by linking each summary sentence to supporting source passages and evaluating a hybrid summarization pipeline (OCCAMS extractive plus GPT paraphrase) against a purely abstractive GPT approach across CrisisFACTS, Cyber Threat Intelligence, and TAC 2011 datasets. It systematically compares attribution methods—NLI versus sentence embeddings—using human judgments and task-based evaluation (Task 1 and Task 2), finding that embedding-based attribution generally aligns better with humans and that the hybrid pipeline often improves attribution ease, albeit with dataset-dependent refutation patterns. The study introduces a refutation typology to categorize factual errors and demonstrates that parsing, time-shift, and related information issues influence attribution quality, with practical implications for trustworthy analyst-ready summaries. Overall, the results highlight the value of a hybrid extraction-plus-paraphrase approach and targeted attribution strategies for reducing hallucinations and improving traceability of automated summaries.
Abstract
Analysts require attribution, as nothing can be reported without knowing the source of the information. In this paper, we will focus on automatic methods for attribution, linking each sentence in the summary to a portion of the source text, which may be in one or more documents. We explore using a hybrid summarization, i.e., an automatic paraphrase of an extractive summary, to ease attribution. We also use a custom topology to identify the proportion of different categories of attribution-related errors.
