Table of Contents
Fetching ...

CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

Francisco Valentini, Diego Kozlowski, Vincent Larivière

TL;DR

This work introduces CLIRudit, the first English–French academic CLIR dataset built from Érudit, enabling large-scale evaluation of first-stage CLIR methods with English keywords as queries and French abstracts as documents. It benchmarks a wide spectrum of dense and sparse retrievers, with and without machine translation, revealing that dense bi-encoders can nearly match MT-enabled performance even without translation, and that document translation provides substantial gains, especially for sparse methods. The study offers practical guidance for deploying cross-lingual scholarly search systems, balancing retrieval effectiveness with translation and indexing costs, and provides a reproducible benchmarking pipeline to extend cross-lingual access to non-English scholarly content across language pairs.

Abstract

Cross-lingual information retrieval (CLIR) helps users find documents in languages different from their queries. This is especially important in academic search, where key research is often published in non-English languages. We present CLIRudit, a novel English-French academic retrieval dataset built from Érudit, a Canadian publishing platform. Using multilingual metadata, we pair English author-written keywords as queries with non-English abstracts as target documents, a method that can be applied to other languages and repositories. We benchmark various first-stage sparse and dense retrievers, with and without machine translation. We find that dense embeddings without translation perform nearly as well as systems using machine translation, that translating documents is generally more effective than translating queries, and that sparse retrievers with document translation remain competitive while offering greater efficiency. Along with releasing the first English-French academic retrieval dataset, we provide a reproducible benchmarking method to improve access to non-English scholarly content.

CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents

TL;DR

This work introduces CLIRudit, the first English–French academic CLIR dataset built from Érudit, enabling large-scale evaluation of first-stage CLIR methods with English keywords as queries and French abstracts as documents. It benchmarks a wide spectrum of dense and sparse retrievers, with and without machine translation, revealing that dense bi-encoders can nearly match MT-enabled performance even without translation, and that document translation provides substantial gains, especially for sparse methods. The study offers practical guidance for deploying cross-lingual scholarly search systems, balancing retrieval effectiveness with translation and indexing costs, and provides a reproducible benchmarking pipeline to extend cross-lingual access to non-English scholarly content across language pairs.

Abstract

Cross-lingual information retrieval (CLIR) helps users find documents in languages different from their queries. This is especially important in academic search, where key research is often published in non-English languages. We present CLIRudit, a novel English-French academic retrieval dataset built from Érudit, a Canadian publishing platform. Using multilingual metadata, we pair English author-written keywords as queries with non-English abstracts as target documents, a method that can be applied to other languages and repositories. We benchmark various first-stage sparse and dense retrievers, with and without machine translation. We find that dense embeddings without translation perform nearly as well as systems using machine translation, that translating documents is generally more effective than translating queries, and that sparse retrievers with document translation remain competitive while offering greater efficiency. Along with releasing the first English-French academic retrieval dataset, we provide a reproducible benchmarking method to improve access to non-English scholarly content.

Paper Structure

This paper contains 16 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The CLIRudit dataset. We use articles with abstracts and keywords in both French and English. English keywords form the queries, with relevance judged by their presence in each article. Documents consist of the French title, subtitle, and abstract.
  • Figure 2: Number of queries per disciplines in the CLIRudit dataset. A query inherits the disciplines of the articles containing its keywords. Since queries can originate from multiple articles and articles can have multiple disciplines, percentages do not sum to 100%.
  • Figure 3: % difference in MAP and Recall@100 of document translation (red) and query translation (blue) compared to no translation. Positive (negative) values indicate improvement (degradation) with translation. For ease of visualization, sparse models are shown with a different scale and only GPT translation is considered.
  • Figure 4: % difference in MAP (green) and Recall@100 (purple) for the best-performing approach of each retriever, relative to gold-standard translations.
  • Figure 5: MAP of retrievers across CLIRudit disciplines. Each dot represents a method's MAP in a discipline's queries, using its best translation method (excluding gold). Dot colors indicate retrievers: Croissant (pink) often performs worst, while the best varies by discipline.