Table of Contents
Fetching ...

Extra Global Attention Designation Using Keyword Detection in Sparse Transformer Architectures

Evan Lucas, Dylan Kangas, Timothy C Havens

TL;DR

The paper addresses the challenge of encoding long-range context in sparse transformer architectures for abstractive summarization. It extends the Longformer-Encoder-Decoder by prefixing the input with TF-IDF selected keyword tokens and granting these tokens global attention on the encoder, while maintaining sparse attention for the rest. Results show dataset-dependent benefits, with notable gains in few-shot settings on arXiv and modest improvements on AMI, but limited or negative effects in large-scale full-set training and multi-topic transcripts, highlighting sensitivity to keyword choice. The study demonstrates a practical method to inject steerable long-range context into sparse transformers and points to future work on more sophisticated keyword selection and controllable summarization strategies.

Abstract

In this paper, we propose an extension to Longformer Encoder-Decoder, a popular sparse transformer architecture. One common challenge with sparse transformers is that they can struggle with encoding of long range context, such as connections between topics discussed at a beginning and end of a document. A method to selectively increase global attention is proposed and demonstrated for abstractive summarization tasks on several benchmark data sets. By prefixing the transcript with additional keywords and encoding global attention on these keywords, improvement in zero-shot, few-shot, and fine-tuned cases is demonstrated for some benchmark data sets.

Extra Global Attention Designation Using Keyword Detection in Sparse Transformer Architectures

TL;DR

The paper addresses the challenge of encoding long-range context in sparse transformer architectures for abstractive summarization. It extends the Longformer-Encoder-Decoder by prefixing the input with TF-IDF selected keyword tokens and granting these tokens global attention on the encoder, while maintaining sparse attention for the rest. Results show dataset-dependent benefits, with notable gains in few-shot settings on arXiv and modest improvements on AMI, but limited or negative effects in large-scale full-set training and multi-topic transcripts, highlighting sensitivity to keyword choice. The study demonstrates a practical method to inject steerable long-range context into sparse transformers and points to future work on more sophisticated keyword selection and controllable summarization strategies.

Abstract

In this paper, we propose an extension to Longformer Encoder-Decoder, a popular sparse transformer architecture. One common challenge with sparse transformers is that they can struggle with encoding of long range context, such as connections between topics discussed at a beginning and end of a document. A method to selectively increase global attention is proposed and demonstrated for abstractive summarization tasks on several benchmark data sets. By prefixing the transcript with additional keywords and encoding global attention on these keywords, improvement in zero-shot, few-shot, and fine-tuned cases is demonstrated for some benchmark data sets.

Paper Structure

This paper contains 8 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Visualization of self-attention matrix sparsity. Blue filled cells represent attention being computed for that pair of inputs.
  • Figure 2: Keyword prefixing block diagram. Keywords in red are given global attention (shown as blue rows/columns in Fig. \ref{['globalAttnFigure']}.