Table of Contents
Fetching ...

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

Guangxiang Zhao, Junyang Lin, Zhiyuan Zhang, Xuancheng Ren, Qi Su, Xu Sun

TL;DR

Explicit Sparse Transformer tackles attention dilution by enforcing top-k sparse attention, concentrating information on the most contributive context positions. It replaces dense softmax attention with a top-k masked mechanism that extends to both self- and context-attention, resulting in faster computation and competitive or better accuracy. Across neural machine translation, image captioning, and language modeling, the approach yields BLEU/METEOR/CIDEr gains and clearer alignments in qualitative analyses. The method offers a simple, scalable sparsification that can regularize training and accelerate large-scale Transformer deployments.

Abstract

Self-attention based Transformer has demonstrated the state-of-the-art performances in a number of natural language processing tasks. Self-attention is able to model long-term dependencies, but it may suffer from the extraction of irrelevant information in the context. To tackle the problem, we propose a novel model called \textbf{Explicit Sparse Transformer}. Explicit Sparse Transformer is able to improve the concentration of attention on the global context through an explicit selection of the most relevant segments. Extensive experimental results on a series of natural language processing and computer vision tasks, including neural machine translation, image captioning, and language modeling, all demonstrate the advantages of Explicit Sparse Transformer in model performance. We also show that our proposed sparse attention method achieves comparable or better results than the previous sparse attention method, but significantly reduces training and testing time. For example, the inference speed is twice that of sparsemax in Transformer model. Code will be available at \url{https://github.com/lancopku/Explicit-Sparse-Transformer}

Explicit Sparse Transformer: Concentrated Attention Through Explicit Selection

TL;DR

Explicit Sparse Transformer tackles attention dilution by enforcing top-k sparse attention, concentrating information on the most contributive context positions. It replaces dense softmax attention with a top-k masked mechanism that extends to both self- and context-attention, resulting in faster computation and competitive or better accuracy. Across neural machine translation, image captioning, and language modeling, the approach yields BLEU/METEOR/CIDEr gains and clearer alignments in qualitative analyses. The method offers a simple, scalable sparsification that can regularize training and accelerate large-scale Transformer deployments.

Abstract

Self-attention based Transformer has demonstrated the state-of-the-art performances in a number of natural language processing tasks. Self-attention is able to model long-term dependencies, but it may suffer from the extraction of irrelevant information in the context. To tackle the problem, we propose a novel model called \textbf{Explicit Sparse Transformer}. Explicit Sparse Transformer is able to improve the concentration of attention on the global context through an explicit selection of the most relevant segments. Extensive experimental results on a series of natural language processing and computer vision tasks, including neural machine translation, image captioning, and language modeling, all demonstrate the advantages of Explicit Sparse Transformer in model performance. We also show that our proposed sparse attention method achieves comparable or better results than the previous sparse attention method, but significantly reduces training and testing time. For example, the inference speed is twice that of sparsemax in Transformer model. Code will be available at \url{https://github.com/lancopku/Explicit-Sparse-Transformer}

Paper Structure

This paper contains 28 sections, 12 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Illustration of self-attention in the models. The orange bar denotes the attention score of our proposed model while the blue bar denotes the attention scores of the vanilla Transformer. The orange line denotes the attention between the target word "tim" and the selected top-$k$ positions in the sequence. In the attention of vanilla Transformer, "tim" assigns too many non-zero attention scores to the irrelevant words. But for the proposal, the top-$k$ largest attention scores removes the distraction from irrelevant words and the attention becomes concentrated.
  • Figure 2: The comparison between the attentions of vanilla Transformer and Explicit Sparse Transformer and the illustration of the attention module of Explicit Sparse Transformer. With the mask based on top-$k$ selection and softmax function, only the most contributive elements are assigned with probabilities.
  • Figure 3: Analyse the value of K on IWSLT En-Vi and De-En datasets. "inf" denotes the special case of the Explicit Sparse Transformer where all positions may be attended, same as the origin Transformer.
  • Figure 4: Figure \ref{['fig:lead']} is the attention visualization of Transformer and Figure \ref{['fig:last']} is that of the Explicit Sparse Transformer. The red box shows that the attentions in vanilla Transformer at most steps are concentrated on the last token of the context.
  • Figure 5: Code for the main idea in Pytorch