Table of Contents
Fetching ...

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

Jialv Zou, Bencheng Liao, Qian Zhang, Wenyu Liu, Xinggang Wang

TL;DR

<3-5 sentence high-level summary> OmniMamba presents a linear-architecture approach to unified multimodal understanding and visual generation by building on Mamba-2 and introducing decoupled vocabularies, task-specific LoRA adapters, and a decoupled two-stage training strategy. Trained on only 2M image-text pairs, it achieves competitive results with state-of-the-art models while delivering substantial inference speedups and memory savings compared to Transformer-based baselines. The method demonstrates strong multimodal understanding and MS-COCO visual generation performance, with notable efficiency advantages that lower the barrier for researchers. Limitations include data scale and Mamba-2's sequence length constraints, pointing to future work on scaling data and extending ultra-long sequence capabilities.

Abstract

Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at https://github.com/hustvl/OmniMamba

OmniMamba: Efficient and Unified Multimodal Understanding and Generation via State Space Models

TL;DR

<3-5 sentence high-level summary> OmniMamba presents a linear-architecture approach to unified multimodal understanding and visual generation by building on Mamba-2 and introducing decoupled vocabularies, task-specific LoRA adapters, and a decoupled two-stage training strategy. Trained on only 2M image-text pairs, it achieves competitive results with state-of-the-art models while delivering substantial inference speedups and memory savings compared to Transformer-based baselines. The method demonstrates strong multimodal understanding and MS-COCO visual generation performance, with notable efficiency advantages that lower the barrier for researchers. Limitations include data scale and Mamba-2's sequence length constraints, pointing to future work on scaling data and extending ultra-long sequence capabilities.

Abstract

Recent advancements in unified multimodal understanding and visual generation (or multimodal generation) models have been hindered by their quadratic computational complexity and dependence on large-scale training data. We present OmniMamba, the first linear-architecture-based multimodal generation model that generates both text and images through a unified next-token prediction paradigm. The model fully leverages Mamba-2's high computational and memory efficiency, extending its capabilities from text generation to multimodal generation. To address the data inefficiency of existing unified models, we propose two key innovations: (1) decoupled vocabularies to guide modality-specific generation, and (2) task-specific LoRA for parameter-efficient adaptation. Furthermore, we introduce a decoupled two-stage training strategy to mitigate data imbalance between two tasks. Equipped with these techniques, OmniMamba achieves competitive performance with JanusFlow while surpassing Show-o across benchmarks, despite being trained on merely 2M image-text pairs, which is 1,000 times fewer than Show-o. Notably, OmniMamba stands out with outstanding inference efficiency, achieving up to a 119.2 times speedup and 63% GPU memory reduction for long-sequence generation compared to Transformer-based counterparts. Code and models are released at https://github.com/hustvl/OmniMamba

Paper Structure

This paper contains 37 sections, 2 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Comprehensive comparison between OmniMamba and other unified understanding and generation models.(a) Our OmniMamba is trained on only 2M image-text pairs, which is 1000 times less than Show-o. (b) With such limited data for training, our OmniMamba significantly outperforms Show-o across a wide range of benchmarks and achieves competitive performance with JanusFlow. Black metrics are for the multimodal understanding benchmark, while the blue metric is for the visual generation task. (c)-(d) We compare the speed and memory of OmniMamba with other unified models on the same single NVIDIA 4090 GPU. OmniMamba demonstrates up to a 119.2$\times$ speedup and 63% GPU memory reduction for long-sequence generation.
  • Figure 2: Architecture of the proposed OmniMamba "MMU" refers to multimodal understanding, while "T2I" refers to text-to-image generation. OmniMamba employs a next-token prediction paradigm for both multimodal understanding and visual generation tasks. To address the distinct requirements of each task—semantic information extraction for multimodal understanding and high-fidelity image compression for visual generation, we utilize separate encoders and heads. Furthermore, we purpose decoupled vocabularies to guide modality-specific generation and task-specific LoRA for parameter-efficient adaptation.
  • Figure 3: The Mamba-2 block with task-specific LoRA. It is worth noting that while the Mamba-2 Block in the Mamba-2 paper has two input projectors, the actual code implementation separates the feature dimensions from a single projector output. For simplicity, we depict only one input projector in our illustration. Our task-specific LoRA is applied to this entire input projector.
  • Figure 4: Training strategy of OmniMamba. The trainable components are indicated by a flame symbol, while the frozen ones are represented by snowflakes. The dashed arrows indicate that this route is temporarily dropped and does not participate in model training.
  • Figure 5: Qualitative results of OmniMamba on multimodal understanding and visual generation.
  • ...and 1 more figures