CLIRudit: Cross-Lingual Information Retrieval of Scientific Documents
Francisco Valentini, Diego Kozlowski, Vincent Larivière
TL;DR
This work introduces CLIRudit, the first English–French academic CLIR dataset built from Érudit, enabling large-scale evaluation of first-stage CLIR methods with English keywords as queries and French abstracts as documents. It benchmarks a wide spectrum of dense and sparse retrievers, with and without machine translation, revealing that dense bi-encoders can nearly match MT-enabled performance even without translation, and that document translation provides substantial gains, especially for sparse methods. The study offers practical guidance for deploying cross-lingual scholarly search systems, balancing retrieval effectiveness with translation and indexing costs, and provides a reproducible benchmarking pipeline to extend cross-lingual access to non-English scholarly content across language pairs.
Abstract
Cross-lingual information retrieval (CLIR) helps users find documents in languages different from their queries. This is especially important in academic search, where key research is often published in non-English languages. We present CLIRudit, a novel English-French academic retrieval dataset built from Érudit, a Canadian publishing platform. Using multilingual metadata, we pair English author-written keywords as queries with non-English abstracts as target documents, a method that can be applied to other languages and repositories. We benchmark various first-stage sparse and dense retrievers, with and without machine translation. We find that dense embeddings without translation perform nearly as well as systems using machine translation, that translating documents is generally more effective than translating queries, and that sparse retrievers with document translation remain competitive while offering greater efficiency. Along with releasing the first English-French academic retrieval dataset, we provide a reproducible benchmarking method to improve access to non-English scholarly content.
