Table of Contents
Fetching ...

Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Weiqing Li, Jinyue Guo, Yaqi Wang, Haiyang Xiao, Yuewei Zhang, Guohua Liu, Hao Henry Wang

Abstract

Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.

Evo-Retriever: LLM-Guided Curriculum Evolution with Viewpoint-Pathway Collaboration for Multimodal Document Retrieval

Abstract

Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.
Paper Structure (42 sections, 6 equations, 6 figures, 10 tables)

This paper contains 42 sections, 6 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Challenges in complex visual document retrieval: (a) Insufficient Spatial Awareness: Models must extract and integrate dispersed information (e.g., combining "Plastic 12%" and "Metal 8%") from dynamic layouts. (b) Intrinsic Vulnerability to Textual Confusion: Models must identify differentiated needs in obfuscated queries (e.g., franchisee info and corporate chart). (c) Stagnation from a Static Curriculum: The learning effectiveness of the initial negative sample set (e.g., containing only "salary") gradually diminishes under a fixed curriculum.
  • Figure 2: The motivation behind LLM-guided evolutionary curriculum. (a) A static threshold is effective in the early training stage, providing informative gradient signals. (b) As the model converges, the same threshold captures only trivial negative samples, leading to diminished gradients. (c) Our LLM-guided evolutionary curriculum adjusts the mining interval to continually discover challenging negatives, ensuring sustained optimization.
  • Figure 3: Evo-Retriever enhances representation via multi-view (Evo. 1) and bidirectional contrast (Evo. 2). Guided by expert knowledge, this Viewpoint-Pathway collaboration dynamically mines hard samples (Evo. 3), enabling continual evolution of the model.
  • Figure 4: Overview of the proposed LLM-EC.
  • Figure 5: Impact of Temperature $\tau$ on the Normalized Gradient Profile. The plot shows the normalized gradient, $\sigma(\Delta s / \tau)$, as a function of the similarity gap $\Delta s$. A lower $\tau$ (e.g., our setting of 0.02, red line) creates a much steeper and narrower transition zone where gradients become substantial. This highlights the need for a curriculum that can precisely target negatives within this narrow "sweet spot" to ensure efficient learning.
  • ...and 1 more figures