S$^3$Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

Xue Wang; Tian Zhou; Jianqing Zhu; Jialin Liu; Kun Yuan; Tao Yao; Wotao Yin; Rong Jin; HanQin Cai

S$^3$Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

Xue Wang, Tian Zhou, Jianqing Zhu, Jialin Liu, Kun Yuan, Tao Yao, Wotao Yin, Rong Jin, HanQin Cai

TL;DR

A smoothed skeleton sketching based Attention structure is proposed, coined S<inline-formula><tex-math notation="LaTeX">$^{3}$</tex-math></inline-formula>Attention, which significantly improves upon the previous attempts to negotiate this trade-off between information preservation and computation reduction.

Abstract

Attention based models have achieved many remarkable breakthroughs in numerous applications. However, the quadratic complexity of Attention makes the vanilla Attention based models hard to apply to long sequence tasks. Various improved Attention structures are proposed to reduce the computation cost by inducing low rankness and approximating the whole sequence by sub-sequences. The most challenging part of those approaches is maintaining the proper balance between information preservation and computation reduction: the longer sub-sequences used, the better information is preserved, but at the price of introducing more noise and computational costs. In this paper, we propose a smoothed skeleton sketching based Attention structure, coined S$^3$Attention, which significantly improves upon the previous attempts to negotiate this trade-off. S$^3$Attention has two mechanisms to effectively minimize the impact of noise while keeping the linear complexity to the sequence length: a smoothing block to mix information over long sequences and a matrix sketching method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of S$^3$Attention both theoretically and empirically. Extensive studies over Long Range Arena (LRA) datasets and six time-series forecasting show that S$^3$Attention significantly outperforms both vanilla Attention and other state-of-the-art variants of Attention structures.

S$^3$Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

TL;DR

A smoothed skeleton sketching based Attention structure is proposed, coined S<inline-formula><tex-math notation="LaTeX">

</tex-math></inline-formula>Attention, which significantly improves upon the previous attempts to negotiate this trade-off between information preservation and computation reduction.

Abstract

Attention, which significantly improves upon the previous attempts to negotiate this trade-off. S

Attention has two mechanisms to effectively minimize the impact of noise while keeping the linear complexity to the sequence length: a smoothing block to mix information over long sequences and a matrix sketching method that simultaneously selects columns and rows from the input matrix. We verify the effectiveness of S

Attention both theoretically and empirically. Extensive studies over Long Range Arena (LRA) datasets and six time-series forecasting show that S

Attention significantly outperforms both vanilla Attention and other state-of-the-art variants of Attention structures.

Paper Structure (20 sections, 3 theorems, 31 equations, 3 figures, 15 tables, 4 algorithms)

This paper contains 20 sections, 3 theorems, 31 equations, 3 figures, 15 tables, 4 algorithms.

Introduction
Related Work
S$^3$Attention
Skeleton Sketching Based Attention
Smoother Component
Experiments
Long-Range Arena
Long-Term Forecasting Tasks for Time Series
Transfer Learning in GLUE Tasks
Training Speed and Peak Memory Usage
Robustness Analysis
Model Parameters Impact
Learning Curve for LRA Experiments
Ablation Study
Concluding Remarks
...and 5 more sections

Key Result

Proposition 1

Let $\bm{X}\in\mathbb{R}^{n\times d}$ be a rank-$r$, $\mu$-incoherent matrix. Without loss of generality, we assume $n\geq d$. Let $\bm{E}\in\mathbb{R}^{n\times d}$ be a noise matrix. By uniformly sampling $\mathcal{O}(\mu r \log n)$ columns and rows from the noisy $\bm{X}+\bm{E}$, Skeleton approxim

Figures (3)

Figure 1: Illustration of the architecture of S$^3$Attention.
Figure 2: Illustration on the effect of the Smoother in Skeleton Attention on Token Matrix.
Figure 3: Learning Curve for LRA experiments.

Theorems & Definitions (7)

Definition 1: $\mu$-incoherence
Proposition 1
proof
Proposition 2
proof
Proposition 3
proof

S$^3$Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

TL;DR

Abstract

S$^3$Attention: Improving Long Sequence Attention with Smoothed Skeleton Sketching

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (7)