LegalBench.PT: A Benchmark for Portuguese Law
Beatriz Canaverde, Telmo Pessoa Pires, Leonor Melo Ribeiro, André F. T. Martins
TL;DR
LegalBench.PT introduces the first benchmark tailored to Portuguese law, combining real law exams with synthetic conversion into MC, true/false, and matching formats. The authors build a Portuguese law taxonomy, collect 341 university exams, and generate 4,723 questions across 31 fields using GPT-4o, followed by rigorous filtering and a lawyer validation sample. Across multiple models, GPT-4o and Claude-3.5-Sonnet achieve the strongest performance, while bias analyses found no substantial bias when replicating data with alternative generators; Portuguese lawyers provide a human baseline that often aligns with smaller models, underscoring ambiguities and dataset noise. The work demonstrates the feasibility and limitations of synthetic benchmark construction for specialized legal domains and outlines directions for expansion, refinement, and deeper human-on-bench assessments.
Abstract
The recent application of LLMs to the legal field has spurred the creation of benchmarks across various jurisdictions and languages. However, no benchmark has yet been specifically designed for the Portuguese legal system. In this work, we present LegalBench.PT, the first comprehensive legal benchmark covering key areas of Portuguese law. To develop LegalBench.PT, we first collect long-form questions and answers from real law exams, and then use GPT-4o to convert them into multiple-choice, true/false, and matching formats. Once generated, the questions are filtered and processed to improve the quality of the dataset. To ensure accuracy and relevance, we validate our approach by having a legal professional review a sample of the generated questions. Although the questions are synthetically generated, we show that their basis in human-created exams and our rigorous filtering and processing methods applied result in a reliable benchmark for assessing LLMs' legal knowledge and reasoning abilities. Finally, we evaluate the performance of leading LLMs on LegalBench.PT and investigate potential biases in GPT-4o's responses. We also assess the performance of Portuguese lawyers on a sample of questions to establish a baseline for model comparison and validate the benchmark.
