Table of Contents
Fetching ...

On Context Utilization in Summarization with Large Language Models

Mathieu Ravaut, Aixin Sun, Nancy F. Chen, Shafiq Joty

TL;DR

This work reveals a pronounced middle-curse in abstractive summarization with large language models, showing that models disproportionately leverage information at the start or end of the source while underutilizing middle content. Through a large-scale study of 6 LLMs, 10 datasets, and 5 metrics, it introduces MiddleSum to quantify middle-salience and evaluates two inference methods—hierarchical and incremental summarization—to mitigate the bias, with mixed, domain-dependent results. The study further demonstrates that extending context length beyond 4k tokens does not consistently improve performance under current evaluation frameworks, and decoding strategies do not eliminate the bias. Overall, the paper calls for saliency-aware evaluation and prompts that better leverage full-context information to enable reliable long-context summarization.

Abstract

Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. However, in question answering, language models exhibit uneven utilization of their input context. They tend to favor the initial and final segments, resulting in a U-shaped performance pattern concerning where the answer is located within the input. This bias raises concerns, particularly in summarization where crucial content may be dispersed throughout the source document(s). Besides, in summarization, mapping facts from the source to the summary is not trivial as salient content is usually re-phrased. In this paper, we conduct the first comprehensive study on context utilization and position bias in summarization. Our analysis encompasses 6 LLMs, 10 datasets, and 5 evaluation metrics. We introduce a new evaluation benchmark called MiddleSum on the which we benchmark two alternative inference methods to alleviate position bias: hierarchical summarization and incremental summarization. Our code and data can be found here: https://github.com/ntunlp/MiddleSum.

On Context Utilization in Summarization with Large Language Models

TL;DR

This work reveals a pronounced middle-curse in abstractive summarization with large language models, showing that models disproportionately leverage information at the start or end of the source while underutilizing middle content. Through a large-scale study of 6 LLMs, 10 datasets, and 5 metrics, it introduces MiddleSum to quantify middle-salience and evaluates two inference methods—hierarchical and incremental summarization—to mitigate the bias, with mixed, domain-dependent results. The study further demonstrates that extending context length beyond 4k tokens does not consistently improve performance under current evaluation frameworks, and decoding strategies do not eliminate the bias. Overall, the paper calls for saliency-aware evaluation and prompts that better leverage full-context information to enable reliable long-context summarization.

Abstract

Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. However, in question answering, language models exhibit uneven utilization of their input context. They tend to favor the initial and final segments, resulting in a U-shaped performance pattern concerning where the answer is located within the input. This bias raises concerns, particularly in summarization where crucial content may be dispersed throughout the source document(s). Besides, in summarization, mapping facts from the source to the summary is not trivial as salient content is usually re-phrased. In this paper, we conduct the first comprehensive study on context utilization and position bias in summarization. Our analysis encompasses 6 LLMs, 10 datasets, and 5 evaluation metrics. We introduce a new evaluation benchmark called MiddleSum on the which we benchmark two alternative inference methods to alleviate position bias: hierarchical summarization and incremental summarization. Our code and data can be found here: https://github.com/ntunlp/MiddleSum.
Paper Structure (25 sections, 2 equations, 10 figures, 8 tables)

This paper contains 25 sections, 2 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Distribution of the relative location of summary bigrams within the source. We split each source document into 20 bins of the same number of words, and plot the distribution of summary bigrams over source bins.
  • Figure 2: Distribution of relative location of input context sentences aligned with sentences from summaries. X-axis corresponds to the source sentence bin, y-axis to the fraction of aligned sentences in each bin.
  • Figure 3: Multi-document summarization performance on Multi-XScience (top row) and Multi-News (bottom row) when a unique relevant document is used, and its position is varied (x-axis). Dashed horizontal lines correspond to the random baseline.
  • Figure 4: Fine-grained evaluation of multi-document summarization on Multi-News with GPT-3.5 when varying the position of a unique relevant input document.
  • Figure 5: Reference-based evaluation on the MiddleSum dataset. We also report (gray bars) performance achieved by uniformly sampling subsets of the same size as MiddleSum from the original datasets, alongside bootstrapping variance (black lines).
  • ...and 5 more figures