Revisiting Zero-Shot Abstractive Summarization in the Era of Large Language Models from the Perspective of Position Bias

Anshuman Chhabra; Hadi Askari; Prasant Mohapatra

Revisiting Zero-Shot Abstractive Summarization in the Era of Large Language Models from the Perspective of Position Bias

Anshuman Chhabra, Hadi Askari, Prasant Mohapatra

TL;DR

The paper tackles zero-shot abstractive summarization with Large Language Models by introducing position bias as a generalization of lead bias. It proposes a measurement framework that maps summary content back to article sentences, segments articles into $K$ parts, and compares gold vs. model-derived positional distributions using the Wasserstein distance $W$ to quantify bias. Empirically, LLMs generally achieve high ROUGE scores with low position bias across most datasets, though XSum exhibits stronger lead bias; encoder–decoder baselines show higher bias in zero-shot settings. The work further demonstrates that finetuning alignment and prompt engineering can influence bias, and provides open-source code to support reproducibility and further study.

Abstract

We characterize and study zero-shot abstractive summarization in Large Language Models (LLMs) by measuring position bias, which we propose as a general formulation of the more restrictive lead bias phenomenon studied previously in the literature. Position bias captures the tendency of a model unfairly prioritizing information from certain parts of the input text over others, leading to undesirable behavior. Through numerous experiments on four diverse real-world datasets, we study position bias in multiple LLM models such as GPT 3.5-Turbo, Llama-2, and Dolly-v2, as well as state-of-the-art pretrained encoder-decoder abstractive summarization models such as Pegasus and BART. Our findings lead to novel insights and discussion on performance and position bias of models for zero-shot summarization tasks.

Revisiting Zero-Shot Abstractive Summarization in the Era of Large Language Models from the Perspective of Position Bias

TL;DR

parts, and compares gold vs. model-derived positional distributions using the Wasserstein distance

to quantify bias. Empirically, LLMs generally achieve high ROUGE scores with low position bias across most datasets, though XSum exhibits stronger lead bias; encoder–decoder baselines show higher bias in zero-shot settings. The work further demonstrates that finetuning alignment and prompt engineering can influence bias, and provides open-source code to support reproducibility and further study.

Abstract

Paper Structure (24 sections, 9 figures, 1 table)

This paper contains 24 sections, 9 figures, 1 table.

Introduction
Related Works
Proposed Approach
Zero-Shot Abstractive Summarization
Formulating and Estimating Position Bias
Results
Discussion
Conclusion
Dividing Articles into $K$ Segments of (Approximately) Equal Length
Additional Results for Other ROUGE Metrics
Additional Position Bias Results for Finetuning BART and Pegasus
Additional Results for Different $\phi$
Additional Results for Measuring Correlation Between ROUGE and Position Bias
Dataset, Model, and Training Details
Detailed Dataset Information
...and 9 more sections

Figures (9)

Figure 1: An example of position bias where gold summary is tail biased and model summary is lead biased.
Figure 2: Visualizing positional distributions of gold and model generated summaries for all datasets. The more "different" these distributions are for a given dataset/model, the more position biased the model is for that dataset.
Figure 3: Measuring performance ($R^1$ score) and position bias (Wasserstein distance between gold and generated summaries' positional distributions). Lower Wasserstein distance values correspond to lower position bias.
Figure 4: Additional results for $R^2$ and $R^L$ metrics.
Figure 5: Visualizing positional distributions of gold and Pegasus/BART generated summaries for all datasets with and without finetuning on a particular dataset (training set). For the finetuned models, the diagonal subfigures are the ones that have the same finetuning and evaluation datasets and have low position bias. All other subfigures have a mismatch between finetuning and evaluation datasets, and exhibit high levels of position biases. That is, the model generated summary positional distribution is very different from the gold summary positional distribution. The no-finetuning results were also shown in Figure \ref{['fig:lines']} and are provided again for reference.
...and 4 more figures

Revisiting Zero-Shot Abstractive Summarization in the Era of Large Language Models from the Perspective of Position Bias

TL;DR

Abstract

Revisiting Zero-Shot Abstractive Summarization in the Era of Large Language Models from the Perspective of Position Bias

Authors

TL;DR

Abstract

Table of Contents

Figures (9)