Table of Contents
Fetching ...

DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding

Tao Shen, Tianyi Zhou, Guodong Long, Jing Jiang, Shirui Pan, Chengqi Zhang

TL;DR

This paper introduces DiSAN, a lightweight, RNN/CNN-free network for sentence encoding built from directional self-attention with temporal-order masks and a multi-dimensional attention mechanism. The forward and backward DiSA blocks capture context with encoded temporal structure, after which a multi-dimensional source-to-token attention produces a fixed-size sentence vector. Empirical results across SNLI, SST, MultiNLI, SICK, and several classification tasks show state-of-the-art or near state-of-the-art performance with substantially fewer parameters and faster computation than recurrent or tree-based models. The approach highlights the viability of attention-centric architectures that incorporate order information without full sequence models, suggesting a practical alternative for scalable NLP systems.

Abstract

Recurrent neural nets (RNN) and convolutional neural nets (CNN) are widely used on NLP tasks to capture the long-term and local dependencies, respectively. Attention mechanisms have recently attracted enormous interest due to their highly parallelizable computation, significantly less training time, and flexibility in modeling dependencies. We propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise). A light-weight neural net, "Directional Self-Attention Network (DiSAN)", is then proposed to learn sentence embedding, based solely on the proposed attention without any RNN/CNN structure. DiSAN is only composed of a directional self-attention with temporal order encoded, followed by a multi-dimensional attention that compresses the sequence into a vector representation. Despite its simple form, DiSAN outperforms complicated RNN models on both prediction quality and time efficiency. It achieves the best test accuracy among all sentence encoding methods and improves the most recent best result by 1.02% on the Stanford Natural Language Inference (SNLI) dataset, and shows state-of-the-art test accuracy on the Stanford Sentiment Treebank (SST), Multi-Genre natural language inference (MultiNLI), Sentences Involving Compositional Knowledge (SICK), Customer Review, MPQA, TREC question-type classification and Subjectivity (SUBJ) datasets.

DiSAN: Directional Self-Attention Network for RNN/CNN-Free Language Understanding

TL;DR

This paper introduces DiSAN, a lightweight, RNN/CNN-free network for sentence encoding built from directional self-attention with temporal-order masks and a multi-dimensional attention mechanism. The forward and backward DiSA blocks capture context with encoded temporal structure, after which a multi-dimensional source-to-token attention produces a fixed-size sentence vector. Empirical results across SNLI, SST, MultiNLI, SICK, and several classification tasks show state-of-the-art or near state-of-the-art performance with substantially fewer parameters and faster computation than recurrent or tree-based models. The approach highlights the viability of attention-centric architectures that incorporate order information without full sequence models, suggesting a practical alternative for scalable NLP systems.

Abstract

Recurrent neural nets (RNN) and convolutional neural nets (CNN) are widely used on NLP tasks to capture the long-term and local dependencies, respectively. Attention mechanisms have recently attracted enormous interest due to their highly parallelizable computation, significantly less training time, and flexibility in modeling dependencies. We propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multi-dimensional (i.e., feature-wise). A light-weight neural net, "Directional Self-Attention Network (DiSAN)", is then proposed to learn sentence embedding, based solely on the proposed attention without any RNN/CNN structure. DiSAN is only composed of a directional self-attention with temporal order encoded, followed by a multi-dimensional attention that compresses the sequence into a vector representation. Despite its simple form, DiSAN outperforms complicated RNN models on both prediction quality and time efficiency. It achieves the best test accuracy among all sentence encoding methods and improves the most recent best result by 1.02% on the Stanford Natural Language Inference (SNLI) dataset, and shows state-of-the-art test accuracy on the Stanford Sentiment Treebank (SST), Multi-Genre natural language inference (MultiNLI), Sentences Involving Compositional Knowledge (SICK), Customer Review, MPQA, TREC question-type classification and Subjectivity (SUBJ) datasets.

Paper Structure

This paper contains 20 sections, 17 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: (a) Traditional (additive/multiplicative) attention and (b) multi-dimensional attention. $z_i$ denotes alignment score $f(x_i, q)$, which is a scalar in (a) but a vector in (b).
  • Figure 2: Directional self-attention (DiSA) mechanism. Here, we use $l_{i, j}$ to denote $f(h_i, h_j)$ in Eq. (\ref{['equ:ds_attention_self_token2token_mask']}).
  • Figure 3: Three positional masks: (a) is the diag-disabled mask $M^{diag}$; (b) and (c) are forward mask $M^{fw}$ and backward mask $M^{bw}$, respectively.
  • Figure 4: Directional self-attention network (DiSAN)
  • Figure 5: Fine-grained sentiment analysis accuracy vs. sentence length. The results of LSTM, Bi-LSTM and Tree-LSTM are from ? ( ?) and the result of DiSAN is the average over five random trials.
  • ...and 3 more figures