Table of Contents
Fetching ...

ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval

Quentin Macé, António Loison, Manuel Faysse

TL;DR

ViDoRe Benchmark V2 addresses saturation in prior visual retrieval benchmarks by introducing challenging, realistic retrieval scenarios including blind contextual querying, long-form and cross-document queries, and a hybrid query-generation pipeline across four multilingual datasets. The approach emphasizes reducing extractive bias and broadening evaluation to multilingual and cross-document contexts, with BeIR-compatible tooling and plans to evolve as a living benchmark. Key findings show substantial headroom for advancement, especially in non-English generalization and cross-domain tasks, and indicate that larger models offer performance gains at higher computational cost, while human-labeled data provides more discriminative signals. The benchmark is positioned to impact real-world visual retrieval research by enabling community-driven dataset growth and ongoing method development.

Abstract

The ViDoRe Benchmark V1 was approaching saturation with top models exceeding 90% nDCG@5, limiting its ability to discern improvements. ViDoRe Benchmark V2 introduces realistic, challenging retrieval scenarios via blind contextual querying, long and cross-document queries, and a hybrid synthetic and human-in-the-loop query generation process. It comprises four diverse, multilingual datasets and provides clear evaluation instructions. Initial results demonstrate substantial room for advancement and highlight insights on model generalization and multilingual capability. This benchmark is designed as a living resource, inviting community contributions to maintain relevance through future evaluations.

ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval

TL;DR

ViDoRe Benchmark V2 addresses saturation in prior visual retrieval benchmarks by introducing challenging, realistic retrieval scenarios including blind contextual querying, long-form and cross-document queries, and a hybrid query-generation pipeline across four multilingual datasets. The approach emphasizes reducing extractive bias and broadening evaluation to multilingual and cross-document contexts, with BeIR-compatible tooling and plans to evolve as a living benchmark. Key findings show substantial headroom for advancement, especially in non-English generalization and cross-domain tasks, and indicate that larger models offer performance gains at higher computational cost, while human-labeled data provides more discriminative signals. The benchmark is positioned to impact real-world visual retrieval research by enabling community-driven dataset growth and ongoing method development.

Abstract

The ViDoRe Benchmark V1 was approaching saturation with top models exceeding 90% nDCG@5, limiting its ability to discern improvements. ViDoRe Benchmark V2 introduces realistic, challenging retrieval scenarios via blind contextual querying, long and cross-document queries, and a hybrid synthetic and human-in-the-loop query generation process. It comprises four diverse, multilingual datasets and provides clear evaluation instructions. Initial results demonstrate substantial room for advancement and highlight insights on model generalization and multilingual capability. This benchmark is designed as a living resource, inviting community contributions to maintain relevance through future evaluations.

Paper Structure

This paper contains 6 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Performance results across models for V1 and V2. We observe strong correlations, although a clear saturation on V1 for top models. Results are in nDCG@5.
  • Figure 2: Performance results across monolingual tasks. ViDoRe v2 leaves substantial room for future improvements, contrasting with ViDoRe v1, which was approaching performance saturation.
  • Figure 3: Performance results across crosslingual tasks.We observe a significant performance gap between models trained exclusively in English using an English-only VLM and those that are not.