Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations

Yibo Yan; Mingdong Ou; Yi Cao; Xin Zou; Shuliang Liu; Jiahao Huo; Yu Huang; James Kwok; Xuming Hu

Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations

Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Jiahao Huo, Yu Huang, James Kwok, Xuming Hu

TL;DR

ColParse is introduced, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation.

Abstract

Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance gains across numerous benchmarks and base models. ColParse thus bridges the critical gap between the fine-grained accuracy of multi-vector retrieval and the practical demands of large-scale deployment, offering a new path towards efficient and interpretable multimodal information systems.

Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations

TL;DR

Abstract

Paper Structure (53 sections, 5 theorems, 17 equations, 10 figures, 4 tables, 4 algorithms)

This paper contains 53 sections, 5 theorems, 17 equations, 10 figures, 4 tables, 4 algorithms.

Introduction
Related Work
Visual Document Retrieval
Mutli-Vector Retrieval
Document Parsing VLM
Methodology
Task Formulation
The ColParse Framework
Layout-Informed Document Parsing
Dual-Stream Encoding
Local Encoding.
Global Encoding.
Global-Local Fusion for Final Representation
Late-Interaction Scoring with ColParse
Theoretical Foundation
...and 38 more sections

Key Result

Theorem 2.3

For a set of random variables $\{X_1, \dots, X_n\}$ and another variable $Y$, the chain rule states: This rule is fundamental for decomposing the information content of a complex system.

Figures (10)

Figure 1: Comparison of natural image retrieval versus VDR.
Figure 2: The illustration of a multi-vector VDR model and three primary optimization strategies for its efficiency bottleneck.
Figure 3: The simplified illustration of ColParse framework.
Figure 4: The performance comparison (evaluated by nDCG@5) between ColParse and baselines on five VDR benchmarks across ten mainstream single-vector multimodal retrieval models. Refer to \ref{['tab:full_results_mmlongbench_vidorev1']} and \ref{['tab:full_results_vidorev2_vidoseek_visrag']} for detailed result records due to the space limit.
Figure 5: Variant study of ColParse and its variants.
...and 5 more figures

Theorems & Definitions (12)

Definition 2.1: Mutual Information
Definition 2.2: Conditional Mutual Information
Theorem 2.3: Chain Rule for Mutual Information
Theorem 2.4: Data Processing Inequality (DPI)
proof : Justification
Corollary 2.6: Information Equivalence of Decomposed Representation
proof
Definition 2.7: Contextual Information Gain
Theorem 2.8: Information in the Fused Representation
proof
...and 2 more

Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations

TL;DR

Abstract

Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (12)