DeMansia: Mamba Never Forgets Any Tokens
Ricky Fang
TL;DR
DeMansia addresses the quadratic complexity of self-attention in transformers by integrating state-space models (Mamba) with vision-specific blocks (ViM) and token-labeling (LV-ViT) to improve image classification efficiency on resource-limited hardware. The architecture uses a four-layer convolutional front-end, bidirectional ViM blocks, and Aux/Class heads trained with token labeling, achieving efficient long-context processing and competitive accuracy. On ImageNet-1k, DeMansia Tiny attains 79.4% top-1 accuracy with a compact parameter count, demonstrating competitive performance against larger Transformer-based and CNN baselines while incurring extra overhead from the auxiliary head. The work suggests a practical, scalable direction for vision transformers and points to future extensions in segmentation and broader feature-extraction backends.
Abstract
This paper examines the mathematical foundations of transformer architectures, highlighting their limitations particularly in handling long sequences. We explore prerequisite models such as Mamba, Vision Mamba (ViM), and LV-ViT that pave the way for our proposed architecture, DeMansia. DeMansia integrates state space models with token labeling techniques to enhance performance in image classification tasks, efficiently addressing the computational challenges posed by traditional transformers. The architecture, benchmark, and comparisons with contemporary models demonstrate DeMansia's effectiveness. The implementation of this paper is available on GitHub at https://github.com/catalpaaa/DeMansia
