Table of Contents
Fetching ...

Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers

Markus Hiller, Krista A. Ehinger, Tom Drummond

Abstract

We present a novel bi-directional Transformer architecture (BiXT) which scales linearly with input size in terms of computational cost and memory consumption, but does not suffer the drop in performance or limitation to only one input modality seen with other efficient Transformer-based approaches. BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module in which input tokens and latent variables attend to each other simultaneously, leveraging a naturally emerging attention-symmetry between the two. This approach unlocks a key bottleneck experienced by Perceiver-like architectures and enables the processing and interpretation of both semantics ('what') and location ('where') to develop alongside each other over multiple layers -- allowing its direct application to dense and instance-based tasks alike. By combining efficiency with the generality and performance of a full Transformer architecture, BiXT can process longer sequences like point clouds, text or images at higher feature resolutions and achieves competitive performance across a range of tasks like point cloud part segmentation, semantic image segmentation, image classification, hierarchical sequence modeling and document retrieval. Our experiments demonstrate that BiXT models outperform larger competitors by leveraging longer sequences more efficiently on vision tasks like classification and segmentation, and perform on par with full Transformer variants on sequence modeling and document retrieval -- but require $28\%$ fewer FLOPs and are up to $8.4\times$ faster.

Perceiving Longer Sequences With Bi-Directional Cross-Attention Transformers

Abstract

We present a novel bi-directional Transformer architecture (BiXT) which scales linearly with input size in terms of computational cost and memory consumption, but does not suffer the drop in performance or limitation to only one input modality seen with other efficient Transformer-based approaches. BiXT is inspired by the Perceiver architectures but replaces iterative attention with an efficient bi-directional cross-attention module in which input tokens and latent variables attend to each other simultaneously, leveraging a naturally emerging attention-symmetry between the two. This approach unlocks a key bottleneck experienced by Perceiver-like architectures and enables the processing and interpretation of both semantics ('what') and location ('where') to develop alongside each other over multiple layers -- allowing its direct application to dense and instance-based tasks alike. By combining efficiency with the generality and performance of a full Transformer architecture, BiXT can process longer sequences like point clouds, text or images at higher feature resolutions and achieves competitive performance across a range of tasks like point cloud part segmentation, semantic image segmentation, image classification, hierarchical sequence modeling and document retrieval. Our experiments demonstrate that BiXT models outperform larger competitors by leveraging longer sequences more efficiently on vision tasks like classification and segmentation, and perform on par with full Transformer variants on sequence modeling and document retrieval -- but require fewer FLOPs and are up to faster.
Paper Structure (43 sections, 3 equations, 9 figures, 12 tables)

This paper contains 43 sections, 3 equations, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Emerging patterns when attending both ways. (\ref{['subfig:pic_orig']}) Input image. (\ref{['subfig:attn_symm_lat2tok']}) depicts the areas of the image that 4 different latents attend to, while (\ref{['subfig:attn_symm_tok2lat']}) inversely shows which image regions attend to these latents (transformed into the same coordinate system for ease of interpretation). (\ref{['subfig:attn_bidir']}) displays which areas & latents are symmetrically attended to using our proposed bi-directional cross-attention.
  • Figure 2: BiXT architecture. (left) Input data passing through one layer of our Bi-Directional Cross-Attention Transformer. (right) Internal structure of proposed efficient bi-directional cross-attention.
  • Figure 3: Scaling trends. Ablating the influence of embedding dimension, varying numbers of latents and sequence lengths for ImageNet1K classification. All models trained with shorter schedule (only 300 epochs) to save computational resources, and comparisons should therefore be performed relative to each other. Red star-markers correspond to BiXT-Ti/16 (Acc. 80.1) from \ref{['tab:imagenet']}. Validation accuracy represented through solid lines, while dashed lines indicate the computational resources.
  • Figure A1: Degrees of Freedom. (\ref{['subfig:dof_uni_row']}) Row-wise softmax for uni-directional cross-attention, based on matrix $\in\mathbb{R}^{M\times N}$ with $M\!\cdot(N\!-\!1)$ degrees of freedom. (\ref{['subfig:dof_uni_col']}) Column-wise softmax for uni-directional cross-attention, based on matrix $\in\mathbb{R}^{M\times N}$ with $N\!\cdot(M\!-\!1)$ degrees of freedom. (\ref{['subfig:dof_bidir']}) Row- and column-wise softmax for our proposed bi-directional cross-attention, using the same matrix $\in\mathbb{R}^{M\times N}$ with $MN\!-\!1$ degrees of freedom.
  • Figure A2: Transitioning from iterative to bi-directional attention. (\ref{['subfig:a']}) Perceiver-like iterative attention, creating a bottleneck and small effective working memory; (\ref{['subfig:b']}) Naïve sequential attention 'unblocking' the bottleneck and extending working memory, but still markedly less efficient than: (\ref{['subfig:c']}) Bi-directional cross-attention used in BiXT, combining efficient linear scaling with competitive performance across various tasks. Note that iterative attention attends to the (unrefined) input at every layer, while sequential and bi-directional attend to variants of the input refined by the previous layer. The Perceiver-like setup additionally uses multiple self-attention layers to refine between each iterative cross-attention ($\times B$) in each architectural layer, whereas sequential and bi-directional variants only use one self-attention operation per architectural layer. Architectures are then built by stacking $L$ layers.
  • ...and 4 more figures