LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval
Narges Baba Ahmadi, Jan Strich, Martin Semmann, Chris Biemann
TL;DR
LEMUR introduces a large-scale multilingual EU environmental-law corpus derived from EUR-Lex PDFs and a Lexical Content Score to quantify PDF-to-text fidelity, addressing retrieval noise in multilingual legal settings. The authors fine-tune three state-of-the-art multilingual embeddings using monolingual and bilingual contrastive objectives, and evaluate retrieval under monolingual, bilingual, and cross-lingual scenarios. Results show consistent retrieval improvements across languages and model sizes, with especially strong gains for low-resource languages, and evidence that fine-tuning yields language-independent, content-level representations transferable to unseen languages. The work provides data, code, and a practical retrieval pipeline (VectorDB) to advance robust multilingual legal retrieval, while outlining limitations in topical coverage, bilingual testing, and residual PDF extraction noise.
Abstract
Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote{\href{https://github.com/nargesbh/eur_lex}{GitHub Repository}} and data\footnote{\href{https://huggingface.co/datasets/G4KMU/LEMUR}{Hugging Face Dataset}}.
