Table of Contents
Fetching ...

PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods

Sławomir Dadas, Michał Perełkiewicz, Rafał Poświata

TL;DR

PIRB introduces a comprehensive Polish Information Retrieval Benchmark consisting of $41$ tasks, including $10$ new datasets, to evaluate dense, sparse, and hybrid retrieval methods. It presents a three-stage recipe—knowledge distillation from English teachers, supervised fine-tuning on Polish data, and a lightweight learning-to-rank rescoring to fuse dense and sparse signals—demonstrating superior performance of dense models and additional gains from hybrids. The work also trains new Polish encoders and publicly releases the benchmark, experiment code, and model checkpoints, aiming to standardize Polish IR evaluation and accelerate progress. Overall, PIRB advances multilingual retrieval research by providing rigorous, language-specific resources and demonstrating effective transfer and fusion strategies for Polish.

Abstract

We present Polish Information Retrieval Benchmark (PIRB), a comprehensive evaluation framework encompassing 41 text information retrieval tasks for Polish. The benchmark incorporates existing datasets as well as 10 new, previously unpublished datasets covering diverse topics such as medicine, law, business, physics, and linguistics. We conduct an extensive evaluation of over 20 dense and sparse retrieval models, including the baseline models trained by us as well as other available Polish and multilingual methods. Finally, we introduce a three-step process for training highly effective language-specific retrievers, consisting of knowledge distillation, supervised fine-tuning, and building sparse-dense hybrid retrievers using a lightweight rescoring model. In order to validate our approach, we train new text encoders for Polish and compare their results with previously evaluated methods. Our dense models outperform the best solutions available to date, and the use of hybrid methods further improves their performance.

PIRB: A Comprehensive Benchmark of Polish Dense and Hybrid Text Retrieval Methods

TL;DR

PIRB introduces a comprehensive Polish Information Retrieval Benchmark consisting of tasks, including new datasets, to evaluate dense, sparse, and hybrid retrieval methods. It presents a three-stage recipe—knowledge distillation from English teachers, supervised fine-tuning on Polish data, and a lightweight learning-to-rank rescoring to fuse dense and sparse signals—demonstrating superior performance of dense models and additional gains from hybrids. The work also trains new Polish encoders and publicly releases the benchmark, experiment code, and model checkpoints, aiming to standardize Polish IR evaluation and accelerate progress. Overall, PIRB advances multilingual retrieval research by providing rigorous, language-specific resources and demonstrating effective transfer and fusion strategies for Polish.

Abstract

We present Polish Information Retrieval Benchmark (PIRB), a comprehensive evaluation framework encompassing 41 text information retrieval tasks for Polish. The benchmark incorporates existing datasets as well as 10 new, previously unpublished datasets covering diverse topics such as medicine, law, business, physics, and linguistics. We conduct an extensive evaluation of over 20 dense and sparse retrieval models, including the baseline models trained by us as well as other available Polish and multilingual methods. Finally, we introduce a three-step process for training highly effective language-specific retrievers, consisting of knowledge distillation, supervised fine-tuning, and building sparse-dense hybrid retrievers using a lightweight rescoring model. In order to validate our approach, we train new text encoders for Polish and compare their results with previously evaluated methods. Our dense models outperform the best solutions available to date, and the use of hybrid methods further improves their performance.
Paper Structure (20 sections, 5 figures, 2 tables)

This paper contains 20 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: An overview of our procedure for building effective retrieval methods. In the first step, we perform knowledge transfer from an English dense retriever using a bilingual corpus. Next, we fine-tune the obtained model on an annotated dataset for text retrieval in the target language using contrastive loss. In the final step, we construct a lightweight hybrid combining dense and sparse methods, utilizing an additional learning-to-rank model.
  • Figure 2: NDCG@10 values obtained by the models trained in our study. We present the results of the original multilingual E5 models, as well as models distilled from FlagEmbeddings based on E5 and Polish RoBERTa. For each model, we show its performance before fine-tuning, after fine-tuning on the Polish MS MARCO dataset, and the result of the sparse-dense hybrid combining the given model with BM25 and SPLADE methods.
  • Figure 3: MRR@10 values obtained by the models trained in our study. We present the results of the original multilingual E5 models, as well as models distilled from FlagEmbeddings based on E5 and Polish RoBERTa. For each model, we show its performance before fine-tuning, after fine-tuning on the Polish MS MARCO dataset, and the result of the sparse-dense hybrid combining the given model with BM25 and SPLADE methods.
  • Figure 4: Recall@100 values obtained by the models trained in our study. We present the results of the original multilingual E5 models, as well as models distilled from FlagEmbeddings based on E5 and Polish RoBERTa. For each model, we show its performance before fine-tuning, after fine-tuning on the Polish MS MARCO dataset, and the result of the sparse-dense hybrid combining the given model with BM25 and SPLADE methods.
  • Figure 5: Accuracy@1 values obtained by the models trained in our study. We present the results of the original multilingual E5 models, as well as models distilled from FlagEmbeddings based on E5 and Polish RoBERTa. For each model, we show its performance before fine-tuning, after fine-tuning on the Polish MS MARCO dataset, and the result of the sparse-dense hybrid combining the given model with BM25 and SPLADE methods.