Table of Contents
Fetching ...

Vision-LSTM: xLSTM as Generic Vision Backbone

Benedikt Alkin, Maximilian Beck, Korbinian Pöppel, Sepp Hochreiter, Johannes Brandstetter

TL;DR

This paper proposes Vision-LSTM (ViL), a generic vision backbone that adapts the xLSTM architecture to process image patch tokens with alternating directional mLSTM blocks. ViL achieves linear computational and memory scaling with sequence length by replacing self-attention with a matrix-memory mLSTM that exchanges information between patches, while preserving an isotropic, non-downsampling design. Through ImageNet-1K pretraining and transfer to ADE20K and VTAB-1K, ViL demonstrates strong performance across classification, segmentation, and diverse transfer tasks, often outperforming optimized ViTs and Vim on several benchmarks. The work also presents thorough ablations on traversal directions, QK convolution, positional embeddings, and classification design, and discusses current hardware limitations and future avenues, highlighting ViL’s potential as an efficient backbone for high-resolution vision tasks once hardware kernels for mLSTM mature.

Abstract

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.

Vision-LSTM: xLSTM as Generic Vision Backbone

TL;DR

This paper proposes Vision-LSTM (ViL), a generic vision backbone that adapts the xLSTM architecture to process image patch tokens with alternating directional mLSTM blocks. ViL achieves linear computational and memory scaling with sequence length by replacing self-attention with a matrix-memory mLSTM that exchanges information between patches, while preserving an isotropic, non-downsampling design. Through ImageNet-1K pretraining and transfer to ADE20K and VTAB-1K, ViL demonstrates strong performance across classification, segmentation, and diverse transfer tasks, often outperforming optimized ViTs and Vim on several benchmarks. The work also presents thorough ablations on traversal directions, QK convolution, positional embeddings, and classification design, and discusses current hardware limitations and future avenues, highlighting ViL’s potential as an efficient backbone for high-resolution vision tasks once hardware kernels for mLSTM mature.

Abstract

Transformers are widely used as generic backbones in computer vision, despite initially introduced for natural language processing. Recently, the Long Short-Term Memory (LSTM) has been extended to a scalable and performant architecture - the xLSTM - which overcomes long-standing LSTM limitations via exponential gating and parallelizable matrix memory structure. In this report, we introduce Vision-LSTM (ViL), an adaption of the xLSTM building blocks to computer vision. ViL comprises a stack of xLSTM blocks where odd blocks process the sequence of patch tokens from top to bottom while even blocks go from bottom to top. Experiments show that ViL holds promise to be further deployed as new generic backbone for computer vision architectures.
Paper Structure (32 sections, 4 equations, 4 figures, 13 tables)

This paper contains 32 sections, 4 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: The efficient and scalable design of Vision-LSTM shows strong performances, uses less FLOPS than Transformer/Mamba counterparts and scales linear to higher resolutions. Performance is averaged over ImageNet accuracy, ADE20K mIoU and VTAB-1K accuracy.
  • Figure 2: Schematic overview of Vision-LSTM (ViL). Following ViT dosovitskiy2021vit, an input image is split into patches and linearly projected. Then, a learnable vector is added per position to the patches, producing a sequence of patch tokens. This sequence is then processed by alternating mLSTM blocks where even blocks flip the sequence before and after the mLSTM layer. For classification, ViL uses the concatenation of the first and the last patch as input to a linear classification head. ViL is an isotropic architecture, i.e., all blocks have the same input and output dimension and no downsampling layers are used except the initial patch embedding. Projection layers process each patch individually and the mLSTM exchanges information between patches.
  • Figure 3: Performance overview of ImageNet-1K pre-trained models in relation to pre-training compute. ViL shows strong performances across classification (ImageNet-1K), semantic segmentation (ADE20K) and transfer classification (VTAB-1K) tasks.
  • Figure 4: Uni-directional, bi-directional, quad-directional and oct-directional traversal paths. Squares represent individual patch tokens. Traversal starts at the circle and goes in direction of the arrow, if no further patches are in a row/column, the traversal continues in the next row/column as indicated by the dashed line.