DocMamba: Efficient Document Pre-training with State Space Model

Pengfei Hu; Zhenrong Zhang; Jiefeng Ma; Shuhang Liu; Jun Du; Jianshu Zhang

DocMamba: Efficient Document Pre-training with State Space Model

Pengfei Hu, Zhenrong Zhang, Jiefeng Ma, Shuhang Liu, Jun Du, Jianshu Zhang

TL;DR

DocMamba replaces Transformer-based attention with a pure State Space Model, achieving linear-time inference suitable for long, text-dense documents. It introduces Segment-First Bidirectional Scan to convert 2-D document layouts into 1-D sequences and employs a multi-layer bidirectional Mamba encoder that fuses text with 2-D positional cues without 1-D positional embeddings. The approach yields state-of-the-art results on FUNSD, CORD, and SROIE while delivering substantial speed and memory benefits, and demonstrates length extrapolation on HRDoc. These findings highlight the practical potential of SSMs for Visually-rich Document Understanding and offer a lightweight, scalable baseline for long-context document processing.

Abstract

In recent years, visually-rich document understanding has attracted increasing attention. Transformer-based pre-trained models have become the mainstream approach, yielding significant performance gains in this field. However, the self-attention mechanism's quadratic computational complexity hinders their efficiency and ability to process long documents. In this paper, we present DocMamba, a novel framework based on the state space model. It is designed to reduce computational complexity to linear while preserving global modeling capabilities. To further enhance its effectiveness in document processing, we introduce the Segment-First Bidirectional Scan (SFBS) to capture contiguous semantic information. Experimental results demonstrate that DocMamba achieves new state-of-the-art results on downstream datasets such as FUNSD, CORD, and SORIE, while significantly improving speed and reducing memory usage. Notably, experiments on the HRDoc confirm DocMamba's potential for length extrapolation.

DocMamba: Efficient Document Pre-training with State Space Model

TL;DR

Abstract

Paper Structure (16 sections, 8 equations, 8 figures, 3 tables)

This paper contains 16 sections, 8 equations, 8 figures, 3 tables.

Introduction
Related Work
Visually-rich Document Understanding
State Space Models
Preliminaries
Method
Segment-First Bidirectional Scan
Model Architecture
Pre-training Strategy
Experiments
Datasets
Implementation Details
Comparison With State-of-the-Art Methods
Ablation Study
Limitation
...and 1 more sections

Figures (8)

Figure 1: Performance and efficiency comparisons between LayoutLMv3 layoutlmv3 and our DocMamba.
Figure 2: Framework of DocMamba (left) and Bidirectional Mamba Encoder (right).
Figure 3: Depiction of Segment-First Bidirectional Scan.
Figure 4: Comparison of GPU memory usage between LayoutLMv3 layoutlmv3 and DocMamba.
Figure 5: Comparison of Frames Per Second (FPS) between LayoutLMv3 layoutlmv3 and DocMamba.
...and 3 more figures

DocMamba: Efficient Document Pre-training with State Space Model

TL;DR

Abstract

DocMamba: Efficient Document Pre-training with State Space Model

Authors

TL;DR

Abstract

Table of Contents

Figures (8)