Table of Contents
Fetching ...

Unstructured Evidence Attribution for Long Context Query Focused Summarization

Dustin Wright, Zain Muhammad Mujahid, Lu Wang, Isabelle Augenstein, David Jurgens

TL;DR

This work introduces unstructured evidence attribution for long-context query-focused summarization (LCQFS) and a synthetic training corpus, SUnsET, generated via a six-stage inductive pipeline. By training adapters and using chunked divide-and-conquer summarization, models learn to copy and cite arbitrary-length evidence from long contexts, reducing the lost-in-the-middle bias. Across five models and four diverse datasets, SUnsET-adapted systems show higher evidence copy accuracy, improved citation quality, and better overall summary quality than fixed-granularity baselines, approaching or matching reference baselines in many cases. The study also highlights the cost effectiveness of synthetic data, the need to manage hallucination risk, and directions for future domain-aware and RAG-enabled improvements.

Abstract

Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query, and extracting and citing evidence spans helps improve the trustworthiness of these summaries. Whereas previous work has focused on evidence citation with fixed levels of granularity (e.g. sentence, paragraph, document, etc.), we propose to extract unstructured (i.e., spans of any length) evidence in order to acquire more relevant and consistent evidence than in the fixed granularity case. We show how existing systems struggle to copy and properly cite unstructured evidence, which also tends to be "lost-in-the-middle". To help models perform this task, we create the Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated using a novel pipeline, which can be used as training supervision for unstructured evidence summarization. We demonstrate across 5 LLMs and 4 datasets spanning human written, synthetic, single, and multi-document settings that LLMs adapted with SUnsET generate more relevant and factually consistent evidence with their summaries, extract evidence from more diverse locations in their context, and can generate more relevant and consistent summaries than baselines with no fine-tuning and fixed granularity evidence. We release SUnsET and our generation code to the public.

Unstructured Evidence Attribution for Long Context Query Focused Summarization

TL;DR

This work introduces unstructured evidence attribution for long-context query-focused summarization (LCQFS) and a synthetic training corpus, SUnsET, generated via a six-stage inductive pipeline. By training adapters and using chunked divide-and-conquer summarization, models learn to copy and cite arbitrary-length evidence from long contexts, reducing the lost-in-the-middle bias. Across five models and four diverse datasets, SUnsET-adapted systems show higher evidence copy accuracy, improved citation quality, and better overall summary quality than fixed-granularity baselines, approaching or matching reference baselines in many cases. The study also highlights the cost effectiveness of synthetic data, the need to manage hallucination risk, and directions for future domain-aware and RAG-enabled improvements.

Abstract

Large language models (LLMs) are capable of generating coherent summaries from very long contexts given a user query, and extracting and citing evidence spans helps improve the trustworthiness of these summaries. Whereas previous work has focused on evidence citation with fixed levels of granularity (e.g. sentence, paragraph, document, etc.), we propose to extract unstructured (i.e., spans of any length) evidence in order to acquire more relevant and consistent evidence than in the fixed granularity case. We show how existing systems struggle to copy and properly cite unstructured evidence, which also tends to be "lost-in-the-middle". To help models perform this task, we create the Summaries with Unstructured Evidence Text dataset (SUnsET), a synthetic dataset generated using a novel pipeline, which can be used as training supervision for unstructured evidence summarization. We demonstrate across 5 LLMs and 4 datasets spanning human written, synthetic, single, and multi-document settings that LLMs adapted with SUnsET generate more relevant and factually consistent evidence with their summaries, extract evidence from more diverse locations in their context, and can generate more relevant and consistent summaries than baselines with no fine-tuning and fixed granularity evidence. We release SUnsET and our generation code to the public.

Paper Structure

This paper contains 36 sections, 24 figures, 8 tables.

Figures (24)

  • Figure 1: Summarization with unstructured evidence requires a model to retrieve spans of any arbitrary length from the context to support individual sentences in the summary. Example given from Llama 3.1 8B trained on our dataset (SUnsET).
  • Figure 2: Examples of fixed-granular and unstructured evidence generated by models in our study. Fixed granular citations may include irrelevant or not enough information to support their citing sentences. Unstructured evidence allows for more flexible and precise evidence.
  • Figure 3: Six stage inductive data generation pipeline. The full prompts for each stage are given in Appendix \ref{['sec:prompts']}\ref{['fig:synth_title_prompt']} - \ref{['fig:validation_prompt']}.
  • Figure 4: Snippets from a SUnsET document.
  • Figure 5: Average relevance and consistency of evidence texts with respect to their citation sentences measured using an autorater DBLP:conf/emnlp/LiuIXWXZ23 based on prompts which have previously undergone human evaluation for quality DBLP:journals/corr/abs-2410-23463. Bold indicates best performance for a given model; "*" and "+" indicate statistical significance above the fixed granularity and non-fine-tuned unstructured baselines, respectively, based on non-overlapping 95% confidence intervals.
  • ...and 19 more figures