Table of Contents
Fetching ...

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

Mohamed Eltahir, Ali Habibullah, Lama Ayash, Tanveer Hussain, Naeemullah Khan

TL;DR

ViC tackles the challenge of fusing multiple heterogeneous retrievers for video retrieval by turning a frozen Vision-Language Model into a zero-shot, list-wise reranker and fuser. It encodes both content evidence and per-list metadata into a single prompt and introduces S-Grid to compactly represent a video as a grid of frames plus subtitles. The method operates in two modes: single-list reranking ($M=1$) and ensemble fusion ($M>1$), achieving state-of-the-art zero-shot Recall@1 on MSR-VTT and VATEX and substantial gains over traditional fusion baselines. The approach highlights the potential of prompt-based multimodal fusion and scalable, training-free reasoning, with practical considerations around context windows and latency.

Abstract

In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC

Vote-in-Context: Turning VLMs into Zero-Shot Rank Fusers

TL;DR

ViC tackles the challenge of fusing multiple heterogeneous retrievers for video retrieval by turning a frozen Vision-Language Model into a zero-shot, list-wise reranker and fuser. It encodes both content evidence and per-list metadata into a single prompt and introduces S-Grid to compactly represent a video as a grid of frames plus subtitles. The method operates in two modes: single-list reranking () and ensemble fusion (), achieving state-of-the-art zero-shot Recall@1 on MSR-VTT and VATEX and substantial gains over traditional fusion baselines. The approach highlights the potential of prompt-based multimodal fusion and scalable, training-free reasoning, with practical considerations around context windows and latency.

Abstract

In the retrieval domain, candidates' fusion from heterogeneous retrievers is a long-standing challenge, particularly for complex, multi-modal data such as videos. While typical fusion techniques are training-free, they rely solely on rank or score signals, disregarding candidates' representations. This work introduces Vote-in-Context (ViC), a generalized, training-free framework that re-thinks list-wise reranking and fusion as a zero-shot reasoning task for a Vision-Language Model (VLM). The core insight is to serialize both content evidence and retriever metadata directly within the VLM's prompt, allowing the model to adaptively weigh retriever consensus against visual-linguistic content. We demonstrate the generality of this framework by applying it to the challenging domain of cross-modal video retrieval. To this end, we introduce the S-Grid, a compact serialization map that represents each video as an image grid, optionally paired with subtitles to enable list-wise reasoning over video candidates. ViC is evaluated both as a single-list reranker, where it dramatically improves the precision of individual retrievers, and as an ensemble fuser, where it consistently outperforms strong baselines like CombSUM. Across video retrieval benchmarks including ActivityNet and VATEX, the framework establishes new state-of-the-art zero-shot retrieval performance, demonstrating its effectiveness in handling complex visual and temporal signals alongside text. In zero-shot settings, ViC achieves Recall@1 scores of 87.1% (t2v) / 89.0% (v2t) on MSR-VTT and 99.6% (v2t) on VATEX, representing massive gains of up to +40 Recall@1 over previous state-of-the-art baselines. We present ViC as a simple, reproducible, and highly effective recipe for turning modern VLMs into powerful zero-shot rerankers and fusers. Code and resources are publicly available at: https://github.com/mohammad2012191/ViC

Paper Structure

This paper contains 26 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Left: R@1 for T2V/V2T on MSR-VTT, DiDeMo, VATEX, and ActivityNet versus strong baselines. Right: Qualitative example where multi-retriever outputs are fused and re-ranked (ViC) to obtain the final list.
  • Figure 2: The Vote-in-Context (ViC) framework. A VLM Reranker jointly weighs serialized content ($Q(\cdot)$, $E(\cdot)$) against retriever metadata (rank, multiplicity) encoded in the Candidate Sequence $C(q)$ by Duplicate-Aware Interleaving step to produce the final ranking $\widehat{R}(q)$.
  • Figure 3: The Vote-in-Context (ViC) framework applied for Text-to-Video (t2v, top) and Video-to-Text (v2t, bottom). The left block shows the initial retrieval stage. The right block (green) shows our ViC framework. The serialization maps ($Q(\cdot)$, $E(\cdot)$) are modality-dependent: S-Grid Sampling is applied to video inputs, while text inputs use the identity.
  • Figure 4: The S-Grid representation.
  • Figure 5: Efficiency vs. Performance Trade-off. Time per query vs. Avg Recall@1 for t2v retrieval over the benchmarks MSR-VTT, DiDeMo and ActivityNet in zero-shot settings. Marker size represents model parameters. The Pareto frontier highlights optimal trade-offs. Latency is measured on a single NVIDIA A100 80GB GPU, averaged over 50 queries for a 1k video retrieval task.
  • ...and 1 more figures