Table of Contents
Fetching ...

LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation

Joseph Enguehard, Morgane Van Ermengem, Kate Atkinson, Sujeong Cha, Arijit Ghosh Chowdhury, Prashanth Kallur Ramaswamy, Jeremy Roghair, Hannah R Marlowe, Carina Suzana Negreanu, Kitty Boxall, Diana Mincu

TL;DR

This work introduces LeMAJ, a reference-free framework for evaluating legal QA by decomposing model answers into Legal Data Points (LDPs) and assessing each for Correctness and Relevance, guided by a human-annotated interface. It demonstrates that granular LDP-based evaluation better aligns with expert judgments than traditional reference-based metrics or prior LLM-as-a-Judge methods, across proprietary and LegalBench datasets. The approach improves inter-annotator agreement and shows practical value through a commercial triage use case that reduces human review time. By open-sourcing the LDPs and providing a scalable evaluation workflow, the authors offer a robust, replicable pathway for rigorously assessing legal QA systems in real-world settings.

Abstract

Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into 'Legal Data Points' (LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.

LeMAJ (Legal LLM-as-a-Judge): Bridging Legal Reasoning and LLM Evaluation

TL;DR

This work introduces LeMAJ, a reference-free framework for evaluating legal QA by decomposing model answers into Legal Data Points (LDPs) and assessing each for Correctness and Relevance, guided by a human-annotated interface. It demonstrates that granular LDP-based evaluation better aligns with expert judgments than traditional reference-based metrics or prior LLM-as-a-Judge methods, across proprietary and LegalBench datasets. The approach improves inter-annotator agreement and shows practical value through a commercial triage use case that reduces human review time. By open-sourcing the LDPs and providing a scalable evaluation workflow, the authors offer a robust, replicable pathway for rigorously assessing legal QA systems in real-world settings.

Abstract

Evaluating large language model (LLM) outputs in the legal domain presents unique challenges due to the complex and nuanced nature of legal analysis. Current evaluation approaches either depend on reference data, which is costly to produce, or use standardized assessment methods, both of which have significant limitations for legal applications. Although LLM-as-a-Judge has emerged as a promising evaluation technique, its reliability and effectiveness in legal contexts depend heavily on evaluation processes unique to the legal industry and how trustworthy the evaluation appears to the human legal expert. This is where existing evaluation methods currently fail and exhibit considerable variability. This paper aims to close the gap: a) we break down lengthy responses into 'Legal Data Points' (LDPs), self-contained units of information, and introduce a novel, reference-free evaluation methodology that reflects how lawyers evaluate legal answers; b) we demonstrate that our method outperforms a variety of baselines on both our proprietary dataset and an open-source dataset (LegalBench); c) we show how our method correlates more closely with human expert evaluations and helps improve inter-annotator agreement; and finally d) we open source our Legal Data Points for a subset of LegalBench used in our experiments, allowing the research community to replicate our results and advance research in this vital area of LLM evaluation on legal question-answering.

Paper Structure

This paper contains 29 sections, 2 figures, 21 tables.

Figures (2)

  • Figure 1: Based on a legal document, a question and an answer, our LeMAJ framework performs an automated evaluation by segmenting the answer into Legal Data Points (LDPs) and evaluating each one. A domain expert might also use this framework to manually evaluate each LDP and produce their own scores.
  • Figure 2: An example of LDPs with both the LLM evaluation performed by LeMAJ and the human evaluation by a human legal expert, resulting in the LeMAJ Alignment score.