Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Weiqing Li; Jinyue Guo; Yaqi Wang; Haiyang Xiao; Yuewei Zhang; Guohua Liu; Hao Henry Wang

Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Weiqing Li, Jinyue Guo, Yaqi Wang, Haiyang Xiao, Yuewei Zhang, Guohua Liu, Hao Henry Wang

Abstract

Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.

Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Abstract

Paper Structure (42 sections, 6 equations, 6 figures, 10 tables)

This paper contains 42 sections, 6 equations, 6 figures, 10 tables.

Introduction
Related works
Paradigm Shift in Document Retrieval: From Text Parsing to Visual Encoding.
Refining Dense Representations: From Global Matching to Multi-Grained Interaction.
Optimizing Training Strategies for Retrieval.
Methodology
Representation Enhancement via Viewpoint-Pathway Collaboration
Multi-View Alignment (MVA).
Bidirectional Contrastive Learning (BCL).
Overall Optimization Objective.
LLM-Guided Evolutionary Curriculum (LLM-EC).
Offline Candidate Pool Generation.
Online LLM-Guided Curriculum Evolution.
Experiments
Datasets
...and 27 more sections

Figures (6)

Figure 1: Challenges in complex visual document retrieval: (a) Insufficient Spatial Awareness: Models must extract and integrate dispersed information (e.g., combining "Plastic 12%" and "Metal 8%") from dynamic layouts. (b) Intrinsic Vulnerability to Textual Confusion: Models must identify differentiated needs in obfuscated queries (e.g., franchisee info and corporate chart). (c) Stagnation from a Static Curriculum: The learning effectiveness of the initial negative sample set (e.g., containing only "salary") gradually diminishes under a fixed curriculum.
Figure 2: The motivation behind LLM-guided evolutionary curriculum. (a) A static threshold is effective in the early training stage, providing informative gradient signals. (b) As the model converges, the same threshold captures only trivial negative samples, leading to diminished gradients. (c) Our LLM-guided evolutionary curriculum adjusts the mining interval to continually discover challenging negatives, ensuring sustained optimization.
Figure 3: Evo-Retriever enhances representation via multi-view (Evo. 1) and bidirectional contrast (Evo. 2). Guided by expert knowledge, this Viewpoint-Pathway collaboration dynamically mines hard samples (Evo. 3), enabling continual evolution of the model.
Figure 4: Overview of the proposed LLM-EC.
Figure 5: Impact of Temperature $\tau$ on the Normalized Gradient Profile. The plot shows the normalized gradient, $\sigma(\Delta s / \tau)$, as a function of the similarity gap $\Delta s$. A lower $\tau$ (e.g., our setting of 0.02, red line) creates a much steeper and narrower transition zone where gradients become substantial. This highlights the need for a curriculum that can precisely target negatives within this narrow "sweet spot" to ensure efficient learning.
...and 1 more figures

Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Abstract

Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Authors

Abstract

Table of Contents

Figures (6)