BEIR-NL: Zero-shot Information Retrieval Benchmark for the Dutch Language
Nikolay Banar, Ehsan Lotfi, Walter Daelemans
TL;DR
BEIR-NL addresses the lack of Dutch zero-shot IR benchmarks by translating the BEIR datasets into Dutch and evaluating a range of lexical, dense, and reranking models. The study finds that BM25 remains a strong baseline, with larger dense models and reranking approaches offering the best performance in many settings, though translation can degrade benchmark reliability. The work highlights the importance of native language resources and calls for future development of Dutch IR datasets to improve zero-shot evaluation. BEIR-NL is publicly available on the Hugging Face hub to foster broader Dutch IR research and fair comparisons.
Abstract
Zero-shot evaluation of information retrieval (IR) models is often performed using BEIR; a large and heterogeneous benchmark composed of multiple datasets, covering different retrieval tasks across various domains. Although BEIR has become a standard benchmark for the zero-shot setup, its exclusively English content reduces its utility for underrepresented languages in IR, including Dutch. To address this limitation and encourage the development of Dutch IR models, we introduce BEIR-NL by automatically translating the publicly accessible BEIR datasets into Dutch. Using BEIR-NL, we evaluated a wide range of multilingual dense ranking and reranking models, as well as the lexical BM25 method. Our experiments show that BM25 remains a competitive baseline, and is only outperformed by the larger dense models trained for retrieval. When combined with reranking models, BM25 achieves performance on par with the best dense ranking models. In addition, we explored the impact of translation on the data by back-translating a selection of datasets to English, and observed a performance drop for both dense and lexical methods, indicating the limitations of translation for creating benchmarks. BEIR-NL is publicly available on the Hugging Face hub.
