Table of Contents
Fetching ...

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

Omri Uzan, Asaf Yehudai, Roi pony, Eyal Shnarch, Ariel Gera

TL;DR

This work tackles the efficiency-performance gap in multimodal visual document retrieval by proposing Guided Query Refinement (GQR), a test-time optimization that refines a primary retriever's query embedding using guidance from a complementary retriever. By operating within the primary retriever's embedding space and computing a KL-based learning signal from a union candidate pool, GQR achieves consistent improvements over base models and surpasses traditional hybrid fusion methods in both accuracy and resource usage. Extensive experiments on ViDoRe benchmarks demonstrate notable latency and memory savings (up to ~14x faster and ~54x less memory) while maintaining strong retrieval performance, effectively expanding the Pareto frontier for multimodal retrieval. The work also analyzes design choices, hyperparameters, and efficiency tradeoffs, providing practical guidance for deploying test-time hybrid retrieval systems.

Abstract

Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model's representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever's query embedding using guidance from a complementary retriever's scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval

Guided Query Refinement: Multimodal Hybrid Retrieval with Test-Time Optimization

TL;DR

This work tackles the efficiency-performance gap in multimodal visual document retrieval by proposing Guided Query Refinement (GQR), a test-time optimization that refines a primary retriever's query embedding using guidance from a complementary retriever. By operating within the primary retriever's embedding space and computing a KL-based learning signal from a union candidate pool, GQR achieves consistent improvements over base models and surpasses traditional hybrid fusion methods in both accuracy and resource usage. Extensive experiments on ViDoRe benchmarks demonstrate notable latency and memory savings (up to ~14x faster and ~54x less memory) while maintaining strong retrieval performance, effectively expanding the Pareto frontier for multimodal retrieval. The work also analyzes design choices, hyperparameters, and efficiency tradeoffs, providing practical guidance for deploying test-time hybrid retrieval systems.

Abstract

Multimodal encoders have pushed the boundaries of visual document retrieval, matching textual query tokens directly to image patches and achieving state-of-the-art performance on public benchmarks. Recent models relying on this paradigm have massively scaled the sizes of their query and document representations, presenting obstacles to deployment and scalability in real-world pipelines. Furthermore, purely vision-centric approaches may be constrained by the inherent modality gap still exhibited by modern vision-language models. In this work, we connect these challenges to the paradigm of hybrid retrieval, investigating whether a lightweight dense text retriever can enhance a stronger vision-centric model. Existing hybrid methods, which rely on coarse-grained fusion of ranks or scores, fail to exploit the rich interactions within each model's representation space. To address this, we introduce Guided Query Refinement (GQR), a novel test-time optimization method that refines a primary retriever's query embedding using guidance from a complementary retriever's scores. Through extensive experiments on visual document retrieval benchmarks, we demonstrate that GQR allows vision-centric models to match the performance of models with significantly larger representations, while being up to 14x faster and requiring 54x less memory. Our findings show that GQR effectively pushes the Pareto frontier for performance and efficiency in multimodal retrieval. We release our code at https://github.com/IBM/test-time-hybrid-retrieval

Paper Structure

This paper contains 58 sections, 17 equations, 9 figures, 25 tables, 1 algorithm.

Figures (9)

  • Figure 1: Hybrid retrieval methods. Aggregating the outputs of two retrievers is typically done at the level of ranks (§\ref{['ssec:ranking']}) or scores (§\ref{['ssec:scores']}). Utilizing the information of both representations effectively and efficiently is difficult to achieve in practice. Here we propose a novel approach of Guided Query Refinement (GQR), using similarity scores from an complementary retriever (left) at test time, to inform the query representation of a primary retriever (right).
  • Figure 2: Guided Query Refinement (GQR).Stage 1: Two retrievers independently encode the query and retrieve top-$K$ documents, forming a candidate pool. Stage 2: The primary query embedding is iteratively refined ($z^{(t)}$) over $T$ iterations, by minimizing the KL divergence between a consensus distribution and the primary distribution.
  • Figure 3: Latency–quality tradeoff in online querying. The $x$ axis is runtime in milliseconds for a single query, on a log scale, and the $y$ axis is the average evaluation score (NDCG@5). Empty squares indicating the primary retriever alone (without applying GQR).
  • Figure 4: Hyperparameter sweep over GQR's learning rate $\alpha$ and optimization steps $T$, averaged over six model pairs on ViDoRe 2.
  • Figure 5: Online latency breakdown of GQR for $T=25$ and $T=50$.
  • ...and 4 more figures