Table of Contents
Fetching ...

GlobalMamba: Global Image Serialization for Vision Mamba

Chengkun Wang, Wenzhao Zheng, Jie Zhou, Jiwen Lu

TL;DR

A global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image, and builds a vision mamba model with a causal input format based on the proposed global image serialization, which can better exploit the causal relations among image sequences.

Abstract

Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing, which ignore the intrinsic 2D structural correlations of images. It is also difficult to extract global information by sequential processing of local patches. In this paper, we propose a global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image. We first convert the image from the spatial domain to the frequency domain using Discrete Cosine Transform (DCT) and then arrange the pixels with corresponding frequency ranges. We further transform each set within the same frequency band back to the spatial domain to obtain a series of images before tokenization. We construct a vision mamba model, GlobalMamba, with a causal input format based on the proposed global image serialization, which can better exploit the causal relations among image sequences. Extensive experiments demonstrate the effectiveness of our GlobalMamba, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K.

GlobalMamba: Global Image Serialization for Vision Mamba

TL;DR

A global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image, and builds a vision mamba model with a causal input format based on the proposed global image serialization, which can better exploit the causal relations among image sequences.

Abstract

Vision mambas have demonstrated strong performance with linear complexity to the number of vision tokens. Their efficiency results from processing image tokens sequentially. However, most existing methods employ patch-based image tokenization and then flatten them into 1D sequences for causal processing, which ignore the intrinsic 2D structural correlations of images. It is also difficult to extract global information by sequential processing of local patches. In this paper, we propose a global image serialization method to transform the image into a sequence of causal tokens, which contain global information of the 2D image. We first convert the image from the spatial domain to the frequency domain using Discrete Cosine Transform (DCT) and then arrange the pixels with corresponding frequency ranges. We further transform each set within the same frequency band back to the spatial domain to obtain a series of images before tokenization. We construct a vision mamba model, GlobalMamba, with a causal input format based on the proposed global image serialization, which can better exploit the causal relations among image sequences. Extensive experiments demonstrate the effectiveness of our GlobalMamba, including image classification on ImageNet-1K, object detection on COCO, and semantic segmentation on ADE20K.

Paper Structure

This paper contains 12 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparisons of different Vision Mamba framerowks. Vim and VMamba adopt a flattening strategy similar to (a) and (b), transmuting two-dimensional images into one-dimensional sequences by row or column, while LocalMamba (c) performs the corresponding flattening within a local window. Notably, these sequences lack the inherent causal sequencing of tokens that is characteristic of the causal architecture of Mamba causal architecture. Differently, GlobalMamba (d) constructs a causal token sequence by frequency, while ensuring that tokens acquire global feature information.
  • Figure 2: The frequency-based global tokenization of GlobalMamba involves frequency-segmenting images into multiple bands, downsampled and tokenized with a lightweight CNN into casual sequences for subsequent processing.
  • Figure 3: The overall framework of GlobalMamba. The causal sequences obtained through global tokenization will undergo iterative feature extraction via multiple Vision Mamba blocks. Each of these blocks is meticulously designed to incorporate layers of normalization, SSM, and MLP. Feature downsampling might be adopted for pyramid architectures such as VMamba.
  • Figure 4: Effect of the causal order: (a) Random division of frequency bands. (b) Dividing the frequency bands in descending order from high to low frequency. (c) Dividing the frequency bands in descending order from low to high frequency. (d) The corresponding classification performances.