DeMansia: Mamba Never Forgets Any Tokens

Ricky Fang

DeMansia: Mamba Never Forgets Any Tokens

Ricky Fang

TL;DR

DeMansia addresses the quadratic complexity of self-attention in transformers by integrating state-space models (Mamba) with vision-specific blocks (ViM) and token-labeling (LV-ViT) to improve image classification efficiency on resource-limited hardware. The architecture uses a four-layer convolutional front-end, bidirectional ViM blocks, and Aux/Class heads trained with token labeling, achieving efficient long-context processing and competitive accuracy. On ImageNet-1k, DeMansia Tiny attains 79.4% top-1 accuracy with a compact parameter count, demonstrating competitive performance against larger Transformer-based and CNN baselines while incurring extra overhead from the auxiliary head. The work suggests a practical, scalable direction for vision transformers and points to future extensions in segmentation and broader feature-extraction backends.

Abstract

This paper examines the mathematical foundations of transformer architectures, highlighting their limitations particularly in handling long sequences. We explore prerequisite models such as Mamba, Vision Mamba (ViM), and LV-ViT that pave the way for our proposed architecture, DeMansia. DeMansia integrates state space models with token labeling techniques to enhance performance in image classification tasks, efficiently addressing the computational challenges posed by traditional transformers. The architecture, benchmark, and comparisons with contemporary models demonstrate DeMansia's effectiveness. The implementation of this paper is available on GitHub at https://github.com/catalpaaa/DeMansia

DeMansia: Mamba Never Forgets Any Tokens

TL;DR

Abstract

Paper Structure (16 sections, 14 equations, 5 figures, 2 tables)

This paper contains 16 sections, 14 equations, 5 figures, 2 tables.

Introduction
Background
Single-Headed Attention
Multi-Head Attention
Related work
Mamba
ViM
LV-ViT
The DeMansia model
Experiments
Experiments Setup
Result
Future Works
Conclusions
Manual for DeMansia's Source Code
...and 1 more sections

Figures (5)

Figure 1: Transformer architecture proposed by attention.
Figure 2: Mamba block proposed by mamba. The Trapezoid is a linear projection layer. $\sigma$ is SiLU / Swish activation GELUswish. $\times$ is multiplicative gate.
Figure 3: ViM block proposed by vim. The Trapezoid is a linear projection layer.
Figure 4: ViM architecture proposed by vim.
Figure 5: Pipeline of training vision transformers with token labeling. lv-vit

DeMansia: Mamba Never Forgets Any Tokens

TL;DR

Abstract

DeMansia: Mamba Never Forgets Any Tokens

Authors

TL;DR

Abstract

Table of Contents

Figures (5)