Table of Contents
Fetching ...

A Comparative Study of Text Retrieval Models on DaReCzech

Jakub Stetina, Martin Fajcik, Michal Stefanik, Michal Hradis

TL;DR

This study conducts a comprehensive evaluation of seven off-the-shelf text retrieval models on the Czech DaReCzech dataset to determine effective approaches for Czech information retrieval. It compares retrieval in Czech directly versus translating to English and benchmarks index size, speed, and memory footprint across models. Gemma2 achieves the highest precision and recall, albeit with a large embedding index, while SPLADE offers memory efficiency and PLAID variants offer a middle ground; Contriever performs relatively poorly. The findings provide practical guidance for deploying Czech IR systems, illustrating trade-offs between accuracy, storage, and latency and informing model selection based on resource constraints.

Abstract

This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.

A Comparative Study of Text Retrieval Models on DaReCzech

TL;DR

This study conducts a comprehensive evaluation of seven off-the-shelf text retrieval models on the Czech DaReCzech dataset to determine effective approaches for Czech information retrieval. It compares retrieval in Czech directly versus translating to English and benchmarks index size, speed, and memory footprint across models. Gemma2 achieves the highest precision and recall, albeit with a large embedding index, while SPLADE offers memory efficiency and PLAID variants offer a middle ground; Contriever performs relatively poorly. The findings provide practical guidance for deploying Czech IR systems, illustrating trade-offs between accuracy, storage, and latency and informing model selection based on resource constraints.

Abstract

This article presents a comprehensive evaluation of 7 off-the-shelf document retrieval models: Splade, Plaid, Plaid-X, SimCSE, Contriever, OpenAI ADA and Gemma2 chosen to determine their performance on the Czech retrieval dataset DaReCzech. The primary objective of our experiments is to estimate the quality of modern retrieval approaches in the Czech language. Our analyses include retrieval quality, speed, and memory footprint. Secondly, we analyze whether it is better to use the model directly in Czech text, or to use machine translation into English, followed by retrieval in English. Our experiments identify the most effective option for Czech information retrieval. The findings revealed notable performance differences among the models, with Gemma22 achieving the highest precision and recall, while Contriever performing poorly. Conclusively, SPLADE and PLAID models offered a balance of efficiency and performance.

Paper Structure

This paper contains 25 sections, 7 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Comparison of Precision and Recall at different values of $k$.
  • Figure 2: Comparison of MRR, NDCG at different values of $k$.
  • Figure 3: Doc size, query latency in relation to P@5.
  • Figure 4: Pairwise overlap and correlation of overlapped items in top-100 responses of different IR systems.
  • Figure 5: BM25 hyperparameters grid search.
  • ...and 1 more figures