Table of Contents
Fetching ...

BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language

Konrad Wojtasik, Vadim Shishkin, Kacper Wołowiec, Arkadiusz Janz, Maciej Piasecki

TL;DR

BEIR-PL presents a Polish zero-shot IR benchmark by translating the BEIR datasets into Polish, enabling evaluation of cross-lingual information retrieval for a historically under-resourced language. The authors establish a baseline suite including BM25, unsupervised dense bi-encoders, and multiple rerankers (HerBERT-based and Polish T5) as well as ColBERT for late-interaction ranking, and they compare against multilingual baselines like LaBSE and mMiniLM. Results show that Polish BM25 underperforms due to morphology, while neural rerankers substantially improve retrieval quality across many datasets; performance is highly dataset-dependent, underscoring the need for per-dataset analysis. BEIR-PL is integrated into the MTEB Benchmark and provides open pre-trained Polish IR models, marking a significant step for Polish NLP and cross-lingual IR research with practical implications for multilingual search systems and evaluation. The dataset and findings offer a foundation for further Polish IR advancements and cross-lingual studies in zero-shot scenarios.

Abstract

The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR) in zero-shot settings, garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to the English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr.~TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark -- a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language,d marking a pioneering development in this field. Additionally, the evaluation revealed that BM25 achieved significantly lower scores for Polish than for English, which can be attributed to high inflection and intricate morphological structure of the Polish language. Finally, we trained various re-ranking models to enhance the BM25 retrieval, and we compared their performance to identify their unique characteristic features. To ensure accurate model comparisons, it is necessary to scrutinise individual results rather than to average across the entire benchmark. Thus, we thoroughly analysed the outcomes of IR models in relation to each individual data subset encompassed by the BEIR benchmark. The benchmark data is available at URL {\bf https://huggingface.co/clarin-knext}.

BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language

TL;DR

BEIR-PL presents a Polish zero-shot IR benchmark by translating the BEIR datasets into Polish, enabling evaluation of cross-lingual information retrieval for a historically under-resourced language. The authors establish a baseline suite including BM25, unsupervised dense bi-encoders, and multiple rerankers (HerBERT-based and Polish T5) as well as ColBERT for late-interaction ranking, and they compare against multilingual baselines like LaBSE and mMiniLM. Results show that Polish BM25 underperforms due to morphology, while neural rerankers substantially improve retrieval quality across many datasets; performance is highly dataset-dependent, underscoring the need for per-dataset analysis. BEIR-PL is integrated into the MTEB Benchmark and provides open pre-trained Polish IR models, marking a significant step for Polish NLP and cross-lingual IR research with practical implications for multilingual search systems and evaluation. The dataset and findings offer a foundation for further Polish IR advancements and cross-lingual studies in zero-shot scenarios.

Abstract

The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR) in zero-shot settings, garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to the English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr.~TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark -- a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language,d marking a pioneering development in this field. Additionally, the evaluation revealed that BM25 achieved significantly lower scores for Polish than for English, which can be attributed to high inflection and intricate morphological structure of the Polish language. Finally, we trained various re-ranking models to enhance the BM25 retrieval, and we compared their performance to identify their unique characteristic features. To ensure accurate model comparisons, it is necessary to scrutinise individual results rather than to average across the entire benchmark. Thus, we thoroughly analysed the outcomes of IR models in relation to each individual data subset encompassed by the BEIR benchmark. The benchmark data is available at URL {\bf https://huggingface.co/clarin-knext}.
Paper Structure (21 sections, 2 figures, 6 tables)

This paper contains 21 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: In retrieval with re-ranking setting, in the first stage, top@k most relevant documents are retrieved by the fast but less accurate model. In our case, it was BM25. Afterward, the documents are re-ranked by a more powerful and more accurate model.
  • Figure 2: BM25 performance on MS Marco passage retrieval on different languagesDBLP:journals/corr/abs-2108-13897.