Table of Contents
Fetching ...

LOCOST: State-Space Models for Long Document Abstractive Summarization

Florian Le Bronnec, Song Duong, Mathieu Ravaut, Alexandre Allauzen, Nancy F. Chen, Vincent Guigue, Alberto Lumbreras, Laure Soulier, Patrick Gallinari

TL;DR

This work proposes LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs that effectively handles input texts exceeding 600K tokens at inference time, setting new state-of-the-art results on full-book summarization and opening new perspectives for long input processing.

Abstract

State-space models are a low-complexity alternative to transformers for encoding long sequences and capturing long-term dependencies. We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of $O(L \log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns. We evaluate our model on a series of long document abstractive summarization tasks. The model reaches a performance level that is 93-96% comparable to the top-performing sparse transformers of the same size while saving up to 50% memory during training and up to 87% during inference. Additionally, LOCOST effectively handles input texts exceeding 600K tokens at inference time, setting new state-of-the-art results on full-book summarization and opening new perspectives for long input processing.

LOCOST: State-Space Models for Long Document Abstractive Summarization

TL;DR

This work proposes LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs that effectively handles input texts exceeding 600K tokens at inference time, setting new state-of-the-art results on full-book summarization and opening new perspectives for long input processing.

Abstract

State-space models are a low-complexity alternative to transformers for encoding long sequences and capturing long-term dependencies. We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs. With a computational complexity of , this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns. We evaluate our model on a series of long document abstractive summarization tasks. The model reaches a performance level that is 93-96% comparable to the top-performing sparse transformers of the same size while saving up to 50% memory during training and up to 87% during inference. Additionally, LOCOST effectively handles input texts exceeding 600K tokens at inference time, setting new state-of-the-art results on full-book summarization and opening new perspectives for long input processing.
Paper Structure (63 sections, 8 equations, 7 figures, 9 tables)

This paper contains 63 sections, 8 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Mean ROUGE score with inference memory usage on long-document summarization with input length 16K (left: SummScreenFD dataset, right: GovReport dataset). The size of the circles represents the training memory usage. LOCOST demonstrates competitive performances compared to state-of-the-art sparse transformers of the same size, while being significantly more memory-efficient at both training and inference.
  • Figure 2: The embedded sequence is contextualized via a gated bidirectional SSM before passing through a gated feedforward net.
  • Figure 3: Visualization of the kernels corresponding to the first dimension for several layers of the pre-trained model. Bins show the average decay of the forward and backward kernels. This illustrates the different scales of each kernel. Layers 1 and 10 capture short and extra-short range contextualizations, while Layers 4 and 7 model extra-long and long contexts, respectively.
  • Figure 4: Memory consumption during a typical training (forward + backward) (left) and inference iteration (only forward) (right). Batch size = 1. Ending cross means out-of-memory or architectural limitations after this point.
  • Figure 5: LOCOST trained on increasing sequence lengths evaluated on BookSum-Book dataset without truncation, with texts reaching up to 600K tokens.
  • ...and 2 more figures