On Context Utilization in Summarization with Large Language Models
Mathieu Ravaut, Aixin Sun, Nancy F. Chen, Shafiq Joty
TL;DR
This work reveals a pronounced middle-curse in abstractive summarization with large language models, showing that models disproportionately leverage information at the start or end of the source while underutilizing middle content. Through a large-scale study of 6 LLMs, 10 datasets, and 5 metrics, it introduces MiddleSum to quantify middle-salience and evaluates two inference methods—hierarchical and incremental summarization—to mitigate the bias, with mixed, domain-dependent results. The study further demonstrates that extending context length beyond 4k tokens does not consistently improve performance under current evaluation frameworks, and decoding strategies do not eliminate the bias. Overall, the paper calls for saliency-aware evaluation and prompts that better leverage full-context information to enable reliable long-context summarization.
Abstract
Large language models (LLMs) excel in abstractive summarization tasks, delivering fluent and pertinent summaries. Recent advancements have extended their capabilities to handle long-input contexts, exceeding 100k tokens. However, in question answering, language models exhibit uneven utilization of their input context. They tend to favor the initial and final segments, resulting in a U-shaped performance pattern concerning where the answer is located within the input. This bias raises concerns, particularly in summarization where crucial content may be dispersed throughout the source document(s). Besides, in summarization, mapping facts from the source to the summary is not trivial as salient content is usually re-phrased. In this paper, we conduct the first comprehensive study on context utilization and position bias in summarization. Our analysis encompasses 6 LLMs, 10 datasets, and 5 evaluation metrics. We introduce a new evaluation benchmark called MiddleSum on the which we benchmark two alternative inference methods to alleviate position bias: hierarchical summarization and incremental summarization. Our code and data can be found here: https://github.com/ntunlp/MiddleSum.
