GreekBarBench: A Challenging Benchmark for Free-Text Legal Reasoning and Citations
Odysseas S. Chlapanis, Dimitrios Galanis, Nikolaos Aletras, Ion Androutsopoulos
TL;DR
GreekBarBench addresses the gap in open-ended, citation-rich legal reasoning benchmarks by grounding questions in the Greek Bar exam across five domains. It introduces a three-dimensional scoring framework (Facts, Cited Articles, Analysis) and an LLM-as-a-judge approach with span-based rubrics, plus a meta-evaluation (GBB-JME) to measure alignment with human experts. In extensive experiments across 13 LLMs, the best models exceed average expert performance but do not reach the top 5% of experts, with performance strongly dependent on the availability of chapter-level context. The authors publicly release the benchmark and judge meta-evaluation data, demonstrate the robustness of the evaluation framework, and highlight context quality and article retrieval as critical factors for reliable legal AI systems, while discussing ethical considerations and practical limitations such as computational cost and absence of retrieval benchmarking.
Abstract
We introduce GreekBarBench, a benchmark that evaluates LLMs on legal questions across five different legal areas from the Greek Bar exams, requiring citations to statutory articles and case facts. To tackle the challenges of free-text evaluation, we propose a three-dimensional scoring system combined with an LLM-as-a-judge approach. We also develop a meta-evaluation benchmark to assess the correlation between LLM-judges and human expert evaluations, revealing that simple, span-based rubrics improve their alignment. Our systematic evaluation of 13 proprietary and open-weight LLMs shows that even though the best models outperform average expert scores, they fall short of the 95th percentile of experts.
