Table of Contents
Fetching ...

Automatic Summarization of Long Documents

Naman Chhibbar, Jugal Kalita

TL;DR

This study introduces three novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications.

Abstract

A vast amount of textual data is added to the internet daily, making utilization and interpretation of such data difficult and cumbersome. As a result, automatic text summarization is crucial for extracting relevant information, saving precious reading time. Although many transformer-based models excel in summarization, they are constrained by their input size, preventing them from processing texts longer than their context size. This study introduces three novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications. We test our algorithms on texts with more than 70,000 words, and our experiments show a significant increase in BERTScore with competitive ROUGE scores.

Automatic Summarization of Long Documents

TL;DR

This study introduces three novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications.

Abstract

A vast amount of textual data is added to the internet daily, making utilization and interpretation of such data difficult and cumbersome. As a result, automatic text summarization is crucial for extracting relevant information, saving precious reading time. Although many transformer-based models excel in summarization, they are constrained by their input size, preventing them from processing texts longer than their context size. This study introduces three novel algorithms that allow any LLM to efficiently overcome its input size limitation, effectively utilizing its full potential without any architectural modifications. We test our algorithms on texts with more than 70,000 words, and our experiments show a significant increase in BERTScore with competitive ROUGE scores.
Paper Structure (12 sections, 7 equations, 6 figures, 4 tables)

This paper contains 12 sections, 7 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: GovReport word counts. Document word counts are on the x-axis with the number of documents on the y-axis.
  • Figure 2: BigPatent word counts. Document word counts are on the x-axis with the number of documents on the y-axis.
  • Figure 3: The Document Skimming Algorithm. The grey blocks represent segments of the document.
  • Figure 4: Segments picked by the Document Skimming algorithm. Y-axis value of the ith segment on x-axis is 1 if its picked, 0 otherwise.
  • Figure 5: Segments picked by the Summarization with Keyword Extraction algorithm. Y-axis value of the ith segment on x-axis is 1 if its picked, 0 otherwise.
  • ...and 1 more figures