ViDoRe Benchmark V2: Raising the Bar for Visual Retrieval
Quentin Macé, António Loison, Manuel Faysse
TL;DR
ViDoRe Benchmark V2 addresses saturation in prior visual retrieval benchmarks by introducing challenging, realistic retrieval scenarios including blind contextual querying, long-form and cross-document queries, and a hybrid query-generation pipeline across four multilingual datasets. The approach emphasizes reducing extractive bias and broadening evaluation to multilingual and cross-document contexts, with BeIR-compatible tooling and plans to evolve as a living benchmark. Key findings show substantial headroom for advancement, especially in non-English generalization and cross-domain tasks, and indicate that larger models offer performance gains at higher computational cost, while human-labeled data provides more discriminative signals. The benchmark is positioned to impact real-world visual retrieval research by enabling community-driven dataset growth and ongoing method development.
Abstract
The ViDoRe Benchmark V1 was approaching saturation with top models exceeding 90% nDCG@5, limiting its ability to discern improvements. ViDoRe Benchmark V2 introduces realistic, challenging retrieval scenarios via blind contextual querying, long and cross-document queries, and a hybrid synthetic and human-in-the-loop query generation process. It comprises four diverse, multilingual datasets and provides clear evaluation instructions. Initial results demonstrate substantial room for advancement and highlight insights on model generalization and multilingual capability. This benchmark is designed as a living resource, inviting community contributions to maintain relevance through future evaluations.
