Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

Martin Asenov; Kenza Benkirane; Dan Goldwater; Aneiss Ghodsi

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

Martin Asenov, Kenza Benkirane, Dan Goldwater, Aneiss Ghodsi

TL;DR

This work demonstrates that BM25 can recover large gaps on multilingual and visual benchmarks and calls for decomposed evaluation benchmarks that separately measure transcription and retrieval capabilities, enabling the field to correctly attribute progress and focus effort where it matters.

Abstract

Retrieval-augmented generation (RAG) is a common way to ground language models in external documents and up-to-date information. Classical retrieval systems relied on lexical methods such as BM25, which rank documents by term overlap with corpus-level weighting. End-to-end multimodal retrievers trained on large query-document datasets claim substantial improvements over these approaches, especially for multilingual documents with complex visual layouts. We demonstrate that better document representation is the primary driver of benchmark improvements. By systematically varying transcription and preprocessing methods while holding the retrieval mechanism fixed, we demonstrate that BM25 can recover large gaps on multilingual and visual benchmarks. Our findings call for decomposed evaluation benchmarks that separately measure transcription and retrieval capabilities, enabling the field to correctly attribute progress and focus effort where it matters.

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

TL;DR

Abstract

Paper Structure (23 sections, 4 figures, 4 tables)

This paper contains 23 sections, 4 figures, 4 tables.

Introduction
Related work and preliminaries
Experiments
Experimental setup
Results
Conclusion
Usage of LLM models disclosure
Multilingual Retrieval: OCR vs. Preprocessing
When OCR Dominates vs. When Preprocessing Dominates
OCR-dominated languages.
Preprocessing-dominated languages.
BM25 Sensitivity to Representation Choices
Multilingual Leaderboard Context
Retrieval Across Figure, Table, and Text Documents
Dense Retrievers Also Benefit from Better Transcription
...and 8 more sections

Figures (4)

Figure 1: Multilingual benchmark results across 15 languages. Average Top-5 retrieval accuracy for different methods. Methods are sorted by performance (highest to lowest). Lexical retrievers (BM25) are shown with diagonal hatching. BM25+OCR indexes text produced by state-of-the-art OCR models and preprocessing techniques per different languages. Release years are shown in parentheses.
Figure 2: Language-specific BM25 optimization. Top-5 retrieval accuracy for BM25 under different OCR/transcription pipelines (Adobe, EasyOCR, Ministral 3B, Mistral OCR 3) and language-specific text processing (lemmatization, stemming, morphological analysis, and segmentation).
Figure 3: OCR impact on text retrieval methods. a) BM25 retrieval performance for different Top-K values b) Impact on different retrieval for different combinations of transcription OCR models and text retrieval models.
Figure 4: Figure-heavy focused QA benchmark results. Average Top-5 retrieval accuracy for different methods. Methods are sorted by performance (highest to lowest). Lexical retrievers (BM25) are shown with diagonal hatching. BM25+OCR uses a small visual language model ministral for transcription. Release years are shown in parentheses.

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

TL;DR

Abstract

Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

Authors

TL;DR

Abstract

Table of Contents

Figures (4)