Table of Contents
Fetching ...

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

Martin Asenov, Kenza Benkirane, Dan Goldwater, Aneiss Ghodsi

TL;DR

This work demonstrates that BM25 can recover large gaps on multilingual and visual benchmarks and calls for decomposed evaluation benchmarks that separately measure transcription and retrieval capabilities, enabling the field to correctly attribute progress and focus effort where it matters.

Abstract

Retrieval-augmented generation (RAG) is a common way to ground language models in external documents and up-to-date information. Classical retrieval systems relied on lexical methods such as BM25, which rank documents by term overlap with corpus-level weighting. End-to-end multimodal retrievers trained on large query-document datasets claim substantial improvements over these approaches, especially for multilingual documents with complex visual layouts. We demonstrate that better document representation is the primary driver of benchmark improvements. By systematically varying transcription and preprocessing methods while holding the retrieval mechanism fixed, we demonstrate that BM25 can recover large gaps on multilingual and visual benchmarks. Our findings call for decomposed evaluation benchmarks that separately measure transcription and retrieval capabilities, enabling the field to correctly attribute progress and focus effort where it matters.

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

TL;DR

This work demonstrates that BM25 can recover large gaps on multilingual and visual benchmarks and calls for decomposed evaluation benchmarks that separately measure transcription and retrieval capabilities, enabling the field to correctly attribute progress and focus effort where it matters.

Abstract

Retrieval-augmented generation (RAG) is a common way to ground language models in external documents and up-to-date information. Classical retrieval systems relied on lexical methods such as BM25, which rank documents by term overlap with corpus-level weighting. End-to-end multimodal retrievers trained on large query-document datasets claim substantial improvements over these approaches, especially for multilingual documents with complex visual layouts. We demonstrate that better document representation is the primary driver of benchmark improvements. By systematically varying transcription and preprocessing methods while holding the retrieval mechanism fixed, we demonstrate that BM25 can recover large gaps on multilingual and visual benchmarks. Our findings call for decomposed evaluation benchmarks that separately measure transcription and retrieval capabilities, enabling the field to correctly attribute progress and focus effort where it matters.
Paper Structure (23 sections, 4 figures, 4 tables)

This paper contains 23 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Multilingual benchmark results across 15 languages. Average Top-5 retrieval accuracy for different methods. Methods are sorted by performance (highest to lowest). Lexical retrievers (BM25) are shown with diagonal hatching. BM25+OCR indexes text produced by state-of-the-art OCR models and preprocessing techniques per different languages. Release years are shown in parentheses.
  • Figure 2: Language-specific BM25 optimization. Top-5 retrieval accuracy for BM25 under different OCR/transcription pipelines (Adobe, EasyOCR, Ministral 3B, Mistral OCR 3) and language-specific text processing (lemmatization, stemming, morphological analysis, and segmentation).
  • Figure 3: OCR impact on text retrieval methods. a) BM25 retrieval performance for different Top-K values b) Impact on different retrieval for different combinations of transcription OCR models and text retrieval models.
  • Figure 4: Figure-heavy focused QA benchmark results. Average Top-5 retrieval accuracy for different methods. Methods are sorted by performance (highest to lowest). Lexical retrievers (BM25) are shown with diagonal hatching. BM25+OCR uses a small visual language model ministral for transcription. Release years are shown in parentheses.