Neural Natural Language Processing for Long Texts: A Survey on Classification and Summarization

Dimitrios Tsirmpas; Ioannis Gkionis; Georgios Th. Papadopoulos; Ioannis Mademlis

Neural Natural Language Processing for Long Texts: A Survey on Classification and Summarization

Dimitrios Tsirmpas, Ioannis Gkionis, Georgios Th. Papadopoulos, Ioannis Mademlis

TL;DR

This article presents an introductory exploration of document-level analysis, addressing the primary challenges, concerns, and existing solutions, and presents an original overarching taxonomy of common deep neural methods for long document analysis.

Abstract

The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural Language Processing (NLP) during the past decade. However, the demands of long document analysis are quite different from those of shorter texts, while the ever increasing size of documents uploaded online renders automated understanding of lengthy texts a critical issue. Relevant applications include automated Web mining, legal document review, medical records analysis, financial reports analysis, contract management, environmental impact assessment, news aggregation, etc. Despite the relatively recent development of efficient algorithms for analyzing long documents, practical tools in this field are currently flourishing. This article serves as an entry point into this dynamic domain and aims to achieve two objectives. First of all, it provides an introductory overview of the relevant neural building blocks, serving as a concise tutorial for the field. Secondly, it offers a brief examination of the current state-of-the-art in two key long document analysis tasks: document classification and document summarization. Sentiment analysis for long texts is also covered, since it is typically treated as a particular case of document classification. Consequently, this article presents an introductory exploration of document-level analysis, addressing the primary challenges, concerns, and existing solutions. Finally, it offers a concise definition of "long text/document", presents an original overarching taxonomy of common deep neural methods for long document analysis and lists publicly available annotated datasets that can facilitate further research in this area.

Neural Natural Language Processing for Long Texts: A Survey on Classification and Summarization

TL;DR

Abstract

Paper Structure (58 sections, 9 equations, 11 figures, 5 tables)

This paper contains 58 sections, 9 equations, 11 figures, 5 tables.

Introduction
Deep Neural Networks Used in Document Analysis
General Neural Architectures
Encoder-Decoder Architectures and the Attention Mechanism
Transformers
Neural Building Blocks for Document Analysis
Word Embeddings
BERT
RoBERTa and ELECTRA
T5
BART
GPT and Large Language Models
Long Document Analysis
Definition
Shared Challenges
...and 43 more sections

Figures (11)

Figure 1: The original Transformer Encoder-Decoder architecture.
Figure 2: BERT input representation. The input embeddings are the sum of the token embeddings, the segmentation embeddings and the position embeddings.
Figure 3: Graphical comparison between the MLM and the RTD tasks.
Figure 4: The T5 pretraining objective.
Figure 5: A proposed taxonomy of DNNs designed for long document analysis, focusing on methods that can be applied for classification or summarization. The methods are presented as the tree leaves, while their color denotes the NLP task they were designed for.
...and 6 more figures

Neural Natural Language Processing for Long Texts: A Survey on Classification and Summarization

TL;DR

Abstract

Neural Natural Language Processing for Long Texts: A Survey on Classification and Summarization

Authors

TL;DR

Abstract

Table of Contents

Figures (11)