Table of Contents
Fetching ...

LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval

Narges Baba Ahmadi, Jan Strich, Martin Semmann, Chris Biemann

TL;DR

LEMUR introduces a large-scale multilingual EU environmental-law corpus derived from EUR-Lex PDFs and a Lexical Content Score to quantify PDF-to-text fidelity, addressing retrieval noise in multilingual legal settings. The authors fine-tune three state-of-the-art multilingual embeddings using monolingual and bilingual contrastive objectives, and evaluate retrieval under monolingual, bilingual, and cross-lingual scenarios. Results show consistent retrieval improvements across languages and model sizes, with especially strong gains for low-resource languages, and evidence that fine-tuning yields language-independent, content-level representations transferable to unseen languages. The work provides data, code, and a practical retrieval pipeline (VectorDB) to advance robust multilingual legal retrieval, while outlining limitations in topical coverage, bilingual testing, and residual PDF extraction noise.

Abstract

Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote{\href{https://github.com/nargesbh/eur_lex}{GitHub Repository}} and data\footnote{\href{https://huggingface.co/datasets/G4KMU/LEMUR}{Hugging Face Dataset}}.

LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval

TL;DR

LEMUR introduces a large-scale multilingual EU environmental-law corpus derived from EUR-Lex PDFs and a Lexical Content Score to quantify PDF-to-text fidelity, addressing retrieval noise in multilingual legal settings. The authors fine-tune three state-of-the-art multilingual embeddings using monolingual and bilingual contrastive objectives, and evaluate retrieval under monolingual, bilingual, and cross-lingual scenarios. Results show consistent retrieval improvements across languages and model sizes, with especially strong gains for low-resource languages, and evidence that fine-tuning yields language-independent, content-level representations transferable to unseen languages. The work provides data, code, and a practical retrieval pipeline (VectorDB) to advance robust multilingual legal retrieval, while outlining limitations in topical coverage, bilingual testing, and residual PDF extraction noise.

Abstract

Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote{\href{https://github.com/nargesbh/eur_lex}{GitHub Repository}} and data\footnote{\href{https://huggingface.co/datasets/G4KMU/LEMUR}{Hugging Face Dataset}}.
Paper Structure (37 sections, 4 equations, 11 figures, 5 tables)

This paper contains 37 sections, 4 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Average Content Score similarity per year (5-year bins) for the five languages used in our experiments
  • Figure 2: Number of documents per country in LEMUR.
  • Figure 3: End-to-end pipeline for data preparation, contrastive fine-tuning, and retrieval. EUR-Lex PDFs are processed into structured JSONL, split into queries (metadata) and documents (legislative text), and used to fine-tune embedding models. The resulting embeddings are indexed for Top-$k$ retrieval of legislative acts.
  • Figure 4: Monolingual fine-tuning of three embedding models (E5, Qwen-0.6B & Qwen-4B) on five languages (EN, DE, FR, LV, MT). Performance is measured using Acc@k for 1/3/5 on test queries evaluated against the test document collection, represented as stacked bars, and compared between the base model and the fine-tuned variant.
  • Figure 5: Cross-lingual fine-tuning for Qwen3 0.6B on five languages (EN, DE, FR, LV, MT). Performance is measured using Acc@k for 1/3/5, with results presented as stacked bars, and compared between the base model and the fine-tuned variant.
  • ...and 6 more figures