Table of Contents
Fetching ...

Separators in Enhancing Autoregressive Pretraining for Vision Mamba

Hanpeng Liu, Zidan Wang, Shuoxi Zhang, Kaiyuan Gao, Kun He

TL;DR

An innovative autoregressive pretraining method for Vision Mamba that substantially extends the input sequence length and introduces new \textbf{S}epara\textbf{T}ors for \textbf{A}uto\textbf{R}egressive pretraining to demarcate and differentiate between different images, known as STAR.

Abstract

The state space model Mamba has recently emerged as a promising paradigm in computer vision, attracting significant attention due to its efficient processing of long sequence tasks. Mamba's inherent causal mechanism renders it particularly suitable for autoregressive pretraining. However, current autoregressive pretraining methods are constrained to short sequence tasks, failing to fully exploit Mamba's prowess in handling extended sequences. To address this limitation, we introduce an innovative autoregressive pretraining method for Vision Mamba that substantially extends the input sequence length. We introduce new \textbf{S}epara\textbf{T}ors for \textbf{A}uto\textbf{R}egressive pretraining to demarcate and differentiate between different images, known as \textbf{STAR}. Specifically, we insert identical separators before each image to demarcate its inception. This strategy enables us to quadruple the input sequence length of Vision Mamba while preserving the original dimensions of the dataset images. Employing this long sequence pretraining technique, our STAR-B model achieved an impressive accuracy of 83.5\% on ImageNet-1k, which is highly competitive in Vision Mamba. These results underscore the potential of our method in enhancing the performance of vision models through improved leveraging of long-range dependencies.

Separators in Enhancing Autoregressive Pretraining for Vision Mamba

TL;DR

An innovative autoregressive pretraining method for Vision Mamba that substantially extends the input sequence length and introduces new \textbf{S}epara\textbf{T}ors for \textbf{A}uto\textbf{R}egressive pretraining to demarcate and differentiate between different images, known as STAR.

Abstract

The state space model Mamba has recently emerged as a promising paradigm in computer vision, attracting significant attention due to its efficient processing of long sequence tasks. Mamba's inherent causal mechanism renders it particularly suitable for autoregressive pretraining. However, current autoregressive pretraining methods are constrained to short sequence tasks, failing to fully exploit Mamba's prowess in handling extended sequences. To address this limitation, we introduce an innovative autoregressive pretraining method for Vision Mamba that substantially extends the input sequence length. We introduce new \textbf{S}epara\textbf{T}ors for \textbf{A}uto\textbf{R}egressive pretraining to demarcate and differentiate between different images, known as \textbf{STAR}. Specifically, we insert identical separators before each image to demarcate its inception. This strategy enables us to quadruple the input sequence length of Vision Mamba while preserving the original dimensions of the dataset images. Employing this long sequence pretraining technique, our STAR-B model achieved an impressive accuracy of 83.5\% on ImageNet-1k, which is highly competitive in Vision Mamba. These results underscore the potential of our method in enhancing the performance of vision models through improved leveraging of long-range dependencies.
Paper Structure (25 sections, 8 equations, 4 figures, 7 tables)

This paper contains 25 sections, 8 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Each image is divided into non-overlapping patches. Multiple spatially adjacent patches are grouped into a cluster. Our STAR adds a separator at the beginning of each image. This separator is also a cluster, with patches on its internal diagonal being vector $\mathbf{1}$, and other positions being vector $\mathbf{0}$.
  • Figure 2: Our STAR architecture. During pretraining, we divide images into patches and group spatially adjacent patches into clusters. We then reorder the patches using a cluster-priority scanning method. Our STAR places a separator before the pixel patch sequence and merges multiple image sequences into a long sequence. This long sequence is fed into a MambaMLP with causal properties. The resulting feature tokens are input to a decoder with causal attention. The attention map of the decoder is bidirectional within clusters and unidirectional between clusters. This allows our STAR to treat clusters as the basic units for autoregressive prediction. The prediction target for the last cluster of one image is the separator cluster of the subsequent image.
  • Figure 3: Different values of Separator.Zeros: all tokens in the cluster are zero vectors. Ones: all tokens in the cluster are one vectors. Embeddings: all tokens in the clusters are nn.embedding(0). Identity: tokens on the diagonal of the cluster are one vectors, while other tokens are zero vectors.
  • Figure 4: Different positions of Separator. SC: separator is placed before the cluster. CS: separator is placed after the cluster. SCS: separator and cluster appear alternately, with the initial separator placed before the cluster. CSC: separator and cluster appear alternately, with the initial separator placed after the cluster.