Table of Contents
Fetching ...

Hierarchical Scheduling for Multi-Vector Image Retrieval

Maoliang Li, Ke Li, Yaoyang Liu, Jiayu Chen, Zihao Zheng, Yinjun Wu, Xiang Chen

TL;DR

HiMIR addresses the bottleneck of single-vector image retrieval in multimodal RAG by introducing a hierarchical decomposition with a 1+M+N Mode that aligns sub-queries with image objects across multiple granularities. It systematically exploits redundancy and sparsity via cross-hierarchy consistency, hierarchy-depth skipping, and offline granularity pruning, coupled with an automated configuration framework to adapt to diverse datasets and latency constraints. The approach yields substantial accuracy gains and up to 3.5x computational savings over prior MVR systems, demonstrated on four large-scale image-text datasets with CLIP/Qwen3-8B components. This hierarchical, scheduling-driven paradigm enhances robustness and practicality for real-world multimodal retrieval in RAG-powered LLM applications.

Abstract

To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - HiMIR. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, HiMIR not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.

Hierarchical Scheduling for Multi-Vector Image Retrieval

TL;DR

HiMIR addresses the bottleneck of single-vector image retrieval in multimodal RAG by introducing a hierarchical decomposition with a 1+M+N Mode that aligns sub-queries with image objects across multiple granularities. It systematically exploits redundancy and sparsity via cross-hierarchy consistency, hierarchy-depth skipping, and offline granularity pruning, coupled with an automated configuration framework to adapt to diverse datasets and latency constraints. The approach yields substantial accuracy gains and up to 3.5x computational savings over prior MVR systems, demonstrated on four large-scale image-text datasets with CLIP/Qwen3-8B components. This hierarchical, scheduling-driven paradigm enhances robustness and practicality for real-world multimodal retrieval in RAG-powered LLM applications.

Abstract

To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - HiMIR. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, HiMIR not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.

Paper Structure

This paper contains 21 sections, 4 equations, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: Multimodal Image Retrieval
  • Figure 2: Image Retrieval with Hierarchical Decomposition
  • Figure 3: Decomposition Granularity and Accuracy Impact
  • Figure 4: Computing Redundancy Analysis
  • Figure 5: Computation Optimization in HiMIR
  • ...and 3 more figures