Automatic Summarization of Long Documents

Naman Chhibbar; Jugal Kalita

Automatic Summarization of Long Documents

Naman Chhibbar, Jugal Kalita

TL;DR

This study introduces three novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications.

Abstract

A vast amount of textual data is added to the internet daily, making utilization and interpretation of such data difficult and cumbersome. As a result, automatic text summarization is crucial for extracting relevant information, saving precious reading time. Although many transformer-based models excel in summarization, they are constrained by their input size, preventing them from processing texts longer than their context size. This study introduces three novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications. We test our algorithms on texts with more than 70,000 words, and our experiments show a significant increase in BERTScore with competitive ROUGE scores.

Automatic Summarization of Long Documents

TL;DR

This study introduces three novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications.

Abstract

Paper Structure (12 sections, 7 equations, 6 figures, 4 tables)

This paper contains 12 sections, 7 equations, 6 figures, 4 tables.

Introduction
Problem Statement
Related Works
Datasets
Methodology
Central Truncation
Document Skimming
Summarization with Keyword Extraction
Evaluation Metrics
Experimental Findings
Future Work
Conclusion

Figures (6)

Figure 1: GovReport word counts. Document word counts are on the x-axis with the number of documents on the y-axis.
Figure 2: BigPatent word counts. Document word counts are on the x-axis with the number of documents on the y-axis.
Figure 3: The Document Skimming Algorithm. The grey blocks represent segments of the document.
Figure 4: Segments picked by the Document Skimming algorithm. Y-axis value of the ith segment on x-axis is 1 if its picked, 0 otherwise.
Figure 5: Segments picked by the Summarization with Keyword Extraction algorithm. Y-axis value of the ith segment on x-axis is 1 if its picked, 0 otherwise.
...and 1 more figures

Automatic Summarization of Long Documents

TL;DR

Abstract

Automatic Summarization of Long Documents

Authors

TL;DR

Abstract

Table of Contents

Figures (6)