Table of Contents
Fetching ...

Bidirectional Long-Range Parser for Sequential Data Understanding

George Leotescu, Daniel Voinea, Alin-Ionut Popa

TL;DR

The paper tackles the difficulty of applying Transformers to long sequences due to quadratic attention complexity. It introduces Bidirectional Long-Range Parser (BLRP), which combines local-window attention with a global bidirectional latent-space synthesis by splitting the input $ extbf{X} obreak o extbf{X}_i obreak( ext{size }t)$ and maintaining a latent block $ extbf{L} obreak o extbf{R}^{l imes d}$ initialized via a learned projection $ extPhi( extbf{X})$. A forward and a backward pass update the latent block through interleaved self-attention and cross-attention, producing a final representation $ extbf{L}^{ ext{FINAL}}$ that encodes both proximal and distant dependencies for downstream tasks. BLRP achieves competitive or superior results on the Long-Range-Arena benchmarks (ListOps, Text, Retrieval) and CIFAR with around $2.58 imes 10^5$ parameters, while ablation studies highlight the critical roles of bidirectional flow and dynamic-projection initializations. The method offers a scalable, versatile approach to long-sequence understanding across language and vision and provides a flexible framework for incorporating global context into sequential data processing.

Abstract

The transformer is a powerful data modelling framework responsible for remarkable performance on a wide range of tasks. However, they are limited in terms of scalability as it is suboptimal and inefficient to process long-sequence data. To this purpose we introduce BLRP (Bidirectional Long-Range Parser), a novel and versatile attention mechanism designed to increase performance and efficiency on long-sequence tasks. It leverages short and long range heuristics in the form of a local sliding window approach combined with a global bidirectional latent space synthesis technique. We show the benefits and versatility of our approach on vision and language domains by demonstrating competitive results against state-of-the-art methods on the Long-Range-Arena and CIFAR benchmarks together with ablations demonstrating the computational efficiency.

Bidirectional Long-Range Parser for Sequential Data Understanding

TL;DR

The paper tackles the difficulty of applying Transformers to long sequences due to quadratic attention complexity. It introduces Bidirectional Long-Range Parser (BLRP), which combines local-window attention with a global bidirectional latent-space synthesis by splitting the input and maintaining a latent block initialized via a learned projection . A forward and a backward pass update the latent block through interleaved self-attention and cross-attention, producing a final representation that encodes both proximal and distant dependencies for downstream tasks. BLRP achieves competitive or superior results on the Long-Range-Arena benchmarks (ListOps, Text, Retrieval) and CIFAR with around parameters, while ablation studies highlight the critical roles of bidirectional flow and dynamic-projection initializations. The method offers a scalable, versatile approach to long-sequence understanding across language and vision and provides a flexible framework for incorporating global context into sequential data processing.

Abstract

The transformer is a powerful data modelling framework responsible for remarkable performance on a wide range of tasks. However, they are limited in terms of scalability as it is suboptimal and inefficient to process long-sequence data. To this purpose we introduce BLRP (Bidirectional Long-Range Parser), a novel and versatile attention mechanism designed to increase performance and efficiency on long-sequence tasks. It leverages short and long range heuristics in the form of a local sliding window approach combined with a global bidirectional latent space synthesis technique. We show the benefits and versatility of our approach on vision and language domains by demonstrating competitive results against state-of-the-art methods on the Long-Range-Arena and CIFAR benchmarks together with ablations demonstrating the computational efficiency.
Paper Structure (7 sections, 6 equations, 4 figures, 5 tables)

This paper contains 7 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Performance comparison for different sequence lengths. We compare our proposed BLRP framework against didolkartemporal and zhu2021long for different sequence lengths on ListOps. BLRP brings scalable performance gains irrespective to sequence length, showing that the bi-directional mechanism increases the model's representative power on all input length ranges. Dotted lines represent average performance for each method.
  • Figure 2: Detailed overview of the proposed BLRP method. The flow is from left to right for the forward pass, followed by right to left for the backward pass. The input sequence $\mathbf{X}$ is split into a list of non-overlaping segments $(\mathbf{X}_i)_{i=1}^T$, which is bidirectionally parsed while capturing the overall information into the latent block $\mathbf{L}$, originally initialized with $\mathbf{L}^\mathtt{INIT}$ via function $\Phi$. In turn, $\mathbf{L}^\mathtt{INIT}$ is added at the start of both forward and backward passes and subsequently used as a residual connection. Thus, we obtain the temporal level state representations, $(\mathbf{L}_i^\mathtt{F})_{i=1}^T$ and $(\mathbf{L}_i^\mathtt{B})_{i=T}^1$, which synthesise the aggregated information from the entire sequence inside $\mathbf{L}^\mathtt{FINAL}$. Notice that we optimally aggregate information at spatial level by iteratively conditioning the latent states on the segment embeddings, and at temporal level by utilizing the corresponding forward segment embeddings to update the backward states.
  • Figure 3: Module $\Theta^{\mathtt{CROSS}}$ at step i. Forward embeddings $\mathbf{\tilde{X}}_i^{\mathtt{F}}$ and latent state $\mathbf{L}_{i}^\mathtt{F}$ are obtained by an interleave usage of $\Theta_{L}^\mathtt{CROSS}$ and $\Theta_{X}^\mathtt{CROSS}$.
  • Figure 4: (Left) Impact of latent block size versus segment size. The rows correspond to segment sizes and are denoted with $t_S$ and the columns correspond to latent block sizes and are denoted with $t_L$. The highest performance (i.e.$41.43$) is obtained with segment size and latent block sizes equal to the value of $100$. The poorest performance is obtained with a segment size of $1$ as the window context is very limited, thus the model not being able to infer the global information within the input sequence. Increasing the latent size is not enough to reach optimal performance. The local context has to be large enough to capture meaningful correlations. (Right) Scalability Analysis in Terms of GPU Memory Usage. We tested the self-scalability in terms of GPU memory consumption for different sequence lengths (i.e.$512$, $1024$, etc). For Transformer vaswani_nips_2017 and Performer choromanskirethinking we were unable to test on extremely long sequences due to GPU memory limitations. All measurements are realised on an NVidia A$10$G machine with $24$ GB of memory.