Table of Contents
Fetching ...

One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support

Ronja Stern, Vishvaksenan Rasiah, Veton Matoshi, Srinanda Brügger Bose, Matthias Stürmer, Ilias Chalkidis, Daniel E. Ho, Joel Niklaus

TL;DR

SMILED introduces seven public multilingual Swiss legal datasets to stress-test long-context processing, domain knowledge, multilingual understanding, multitasking, and legal reasoning. The authors pretrained in-domain models and evaluated various generation, classification, and retrieval tasks, finding that domain-specific, multilingual models still lag behind but show meaningful gains over baselines. Key contributions include public datasets, two large in-domain pretraining corpora, and three Legal-Swiss PLMs, with a CC BY-SA license for all resources. The work highlights significant room for improvement in court-view generation and information retrieval, underscoring the need for retrieval-augmented generation and larger multilingual models to advance practical judicial support tools. The benchmark is designed to drive progress with real-world Swiss legal data while emphasizing multilinguality and long documents, aiming to improve accessibility and efficiency in judicial workflows.

Abstract

Recent strides in Large Language Models (LLMs) have saturated many Natural Language Processing (NLP) benchmarks, emphasizing the need for more challenging ones to properly assess LLM capabilities. However, domain-specific and multilingual benchmarks are rare because they require in-depth expertise to develop. Still, most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. In this work, we introduce a novel NLP benchmark for the legal domain that challenges LLMs in five key dimensions: processing \emph{long documents} (up to 50K tokens), using \emph{domain-specific knowledge} (embodied in legal texts), \emph{multilingual} understanding (covering five languages), \emph{multitasking} (comprising legal document-to-document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks) and \emph{reasoning} (comprising especially Court View Generation, but also the Text Classification tasks). Our benchmark contains diverse datasets from the Swiss legal system, allowing for a comprehensive study of the underlying non-English, inherently multilingual legal system. Despite the large size of our datasets (some with hundreds of thousands of examples), existing publicly available multilingual models struggle with most tasks, even after extensive in-domain pre-training and fine-tuning. We publish all resources (benchmark suite, pre-trained models, code) under permissive open CC BY-SA licenses.

One Law, Many Languages: Benchmarking Multilingual Legal Reasoning for Judicial Support

TL;DR

SMILED introduces seven public multilingual Swiss legal datasets to stress-test long-context processing, domain knowledge, multilingual understanding, multitasking, and legal reasoning. The authors pretrained in-domain models and evaluated various generation, classification, and retrieval tasks, finding that domain-specific, multilingual models still lag behind but show meaningful gains over baselines. Key contributions include public datasets, two large in-domain pretraining corpora, and three Legal-Swiss PLMs, with a CC BY-SA license for all resources. The work highlights significant room for improvement in court-view generation and information retrieval, underscoring the need for retrieval-augmented generation and larger multilingual models to advance practical judicial support tools. The benchmark is designed to drive progress with real-world Swiss legal data while emphasizing multilinguality and long documents, aiming to improve accessibility and efficiency in judicial workflows.

Abstract

Recent strides in Large Language Models (LLMs) have saturated many Natural Language Processing (NLP) benchmarks, emphasizing the need for more challenging ones to properly assess LLM capabilities. However, domain-specific and multilingual benchmarks are rare because they require in-depth expertise to develop. Still, most public models are trained predominantly on English corpora, while other languages remain understudied, particularly for practical domain-specific NLP tasks. In this work, we introduce a novel NLP benchmark for the legal domain that challenges LLMs in five key dimensions: processing \emph{long documents} (up to 50K tokens), using \emph{domain-specific knowledge} (embodied in legal texts), \emph{multilingual} understanding (covering five languages), \emph{multitasking} (comprising legal document-to-document Information Retrieval, Court View Generation, Leading Decision Summarization, Citation Extraction, and eight challenging Text Classification tasks) and \emph{reasoning} (comprising especially Court View Generation, but also the Text Classification tasks). Our benchmark contains diverse datasets from the Swiss legal system, allowing for a comprehensive study of the underlying non-English, inherently multilingual legal system. Despite the large size of our datasets (some with hundreds of thousands of examples), existing publicly available multilingual models struggle with most tasks, even after extensive in-domain pre-training and fine-tuning. We publish all resources (benchmark suite, pre-trained models, code) under permissive open CC BY-SA licenses.
Paper Structure (98 sections, 21 figures, 26 tables)

This paper contains 98 sections, 21 figures, 26 tables.

Figures (21)

  • Figure 1: We showcase how we picture this benchmark supporting the judicial system end-to-end. 1) we route incoming complaints to the correct chamber (LAP), 2) we prioritize easier cases for preliminary automated processing (CP), 3) IR and CE help judges and clerks in finding and citing relevant legislation and case law, 4) we help the judiciary in drafting decisions (CVG), 5) we predict and verify judgments based on the history of the court (JP), 6) we predict which decisions are likely going to have a big impact on future jurisprudence (CP) and 7) we summarize the leading decisions (LDS). The LLMs performing all these tasks are trained on and retrieve from the pretraining corpus containing most publicly available Swiss legal data.
  • Figure 2: Database Creation Pipeline
  • Figure 3: Rulings text length distribution
  • Figure 4: Legislation text length distribution
  • Figure 5: Language dist. rulings texts
  • ...and 16 more figures