Table of Contents
Fetching ...

Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo, Shuliang Liu, James Kwok, Xuming Hu

TL;DR

This work introduces Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches to summarizing semantic content without the noise-induced feature dilution seen in single-stage methods.

Abstract

Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.

Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

TL;DR

This work introduces Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches to summarizing semantic content without the noise-induced feature dilution seen in single-stage methods.

Abstract

Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in single-stage methods. Extensive experiments on 29 VDR datasets demonstrate that our framework consistently outperforms existing methods, significantly extending the near-lossless compression range and providing robust performance at high compression ratios.
Paper Structure (64 sections, 3 theorems, 9 equations, 18 figures, 20 tables)

This paper contains 64 sections, 3 theorems, 9 equations, 18 figures, 20 tables.

Key Result

Theorem D.1

(Information-Preserving Noise Filtering) Let the full patch set $\mathbf{D}$ be a disjoint union of a signal set $\mathbf{D}_{\text{sig}}$ and a noise set $\mathbf{D}_{\text{noi}}$. Let the importance score $I(\mathbf{d}_j)$ be a proxy for the information a patch $\mathbf{d}_j$ provides about the do

Figures (18)

  • Figure 1: Comparison of single-vec vs. multi-vec VDR.
  • Figure 2: Comparison of pruning-based vs. merging-based efficient VDR paradigms.
  • Figure 3: Performance comparison (nDCG@5) between Prune-then-Merge and baselines on ViDoRe-V1 faysse2024colpali across Jina-v4 (Left), ColQwen2.5 (Middle), and ColNomic (Right). solid lines denote adaptive methods, whereas dashed lines denote non-adaptive ones; circular nodes represent pruning methods, whereas square nodes represent merging ones.
  • Figure 4: Performance comparison (nDCG@5) between Prune-then-Merge and baselines on JinaVDR gunther2025jina across Jina-v4 (Left), ColQwen2.5 (Middle), and ColNomic (Right). solid lines denote adaptive methods, whereas dashed lines denote non-adaptive ones; circular nodes represent pruning methods, whereas square nodes represent merging ones.
  • Figure 5: Performance comparison (nDCG@5) between Prune-then-Merge and baselines on REAL-MM-RAG wasserman2025realmmragrealworldmultimodalretrieval across Jina-v4 (Left), ColQwen2.5 (Middle) & ColNomic (Right). solid lines denote adaptive methods, whereas dashed lines denote non-adaptive ones; circular nodes represent pruning methods, whereas square nodes represent merging ones.
  • ...and 13 more figures

Theorems & Definitions (3)

  • Theorem D.1
  • Theorem D.2
  • Corollary 1: Synergistic Distortion Reduction