Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

Taiji Li; Zhi Li; Yin Zhang

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

Taiji Li, Zhi Li, Yin Zhang

TL;DR

This work tackles the hallucination problem in LLM-based abstractive summarization, especially under long-context settings. It introduces SliSum, a sliding-generation framework that processes overlapping windows of the source article, then filters and aggregates local summaries through lexical clustering and self-consistency-driven majority voting to form a faithful global summary. The method, applicable to diverse LLMs (e.g., LLaMA-2, Claude-2, GPT-3.5) and across short and long documents, improves factual consistency metrics (FactCC, SummaC) while preserving fluency and informativeness, without requiring fine-tuning or external data. The authors provide extensive ablations and hyperparameter analyses and show a favorable complexity profile (near $O(L)$) with practical runtime costs, supporting broad applicability for reliable automatic summarization in real-world settings.

Abstract

Despite large language models (LLMs) have demonstrated impressive performance in various tasks, they are still suffering from the factual inconsistency problem called hallucinations. For instance, LLMs occasionally generate content that diverges from source article, and prefer to extract information that appears at the beginning and end of the context, especially in long document summarization. Inspired by these findings, we propose to improve the faithfulness of LLMs in summarization by impelling them to process the entire article more fairly and faithfully. We present a novel summary generation strategy, namely SliSum, which exploits the ideas of sliding windows and self-consistency. Specifically, SliSum divides the source article into overlapping windows, and utilizes LLM to generate local summaries for the content in the windows. Finally, SliSum aggregates all local summaries using clustering and majority voting algorithm to produce more faithful summary of entire article. Extensive experiments demonstrate that SliSum significantly improves the faithfulness of diverse LLMs including LLaMA-2, Claude-2 and GPT-3.5 in both short and long text summarization, while maintaining their fluency and informativeness and without additional fine-tuning and resources. We further conduct qualitative and quantitative studies to investigate why SliSum works and impacts of hyperparameters in SliSum on performance.

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

TL;DR

) with practical runtime costs, supporting broad applicability for reliable automatic summarization in real-world settings.

Abstract

Paper Structure (37 sections, 8 equations, 4 figures, 7 tables)

This paper contains 37 sections, 8 equations, 4 figures, 7 tables.

Introduction
Related Works
Factual Consistency of Summarization
Mitigation of LLM Hallucination
Long Context for LLMs
Approach
Sliding Generation
Sliding Window
Process Repeatedly
Events Filtering
Lexical Clustering
Filtering Noise
Contradictions Detection and Sentences Aggregation
Sentences Selection
Sentences Integration
...and 22 more sections

Figures (4)

Figure 1: The pipeline and example of our proposed SliSum approach. In order to solve self-contradiction problem, SliSum take majority vote over sentences of each cluster base on their semantics and select the category with the most votes. For instance, the green sentences have the similar semantics and appear twice, while the red sentence with different semantics appear only once. Hence, the second green sentence is selected to be output to the final summary. In the implementation, SliSum processes the source article at the sentence level. For the simplicity of the illustration, the windows in the figure are represented by text lines.
Figure 2: The performance of GPT-3.5 evaluated on samples of different length.
Figure 3: The factual consistency of sliding generation with aggregation and without aggregation.
Figure 4: Impact of ratio $L_w / L_s$ (left) and window size (right) on faithfulness of GPT-3.5. To analyze ratio, We fix $L_w = 900$ and gradually decrease $L_s$ to adjust the ratio. To analyze window size, We fix $L_w / L_s = 5$ and increase $L_w$ 150 words each time from 150 to 1200 words.

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

TL;DR

Abstract

Improving Faithfulness of Large Language Models in Summarization via Sliding Generation and Self-Consistency

Authors

TL;DR

Abstract

Table of Contents

Figures (4)