Table of Contents
Fetching ...

Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics

Jinu Lee, Kyoung-Woon On, Simeng Han, Arman Cohan, Julia Hockenmaier

TL;DR

<3-5 sentence high-level summary> LEGIT introduces a large-scale Korean legal judgment dataset with legal issue trees and rubric-based evaluative signals for reasoning traces, enabling reliable LLM-as-a-judge assessments of issue coverage and correctness. The dataset supports a backward-chaining evaluation of legal reasoning and demonstrates strong inter-rater reliability with human experts and substantial agreement from capable LLM evaluators. Findings show that LLMs struggle with decomposing and correctly reasoning about legal issues, and that retrieval-augmented generation and reinforcement learning with LEGIT rubrics offer complementary improvements by broadening coverage and improving correctness, respectively. The work highlights rubric-based evaluation as a key step toward achieving expert-level reasoning in high-stakes domains and points to practical avenues for improving LLM legal reasoning.

Abstract

Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.

Evaluating Legal Reasoning Traces with Legal Issue Tree Rubrics

TL;DR

<3-5 sentence high-level summary> LEGIT introduces a large-scale Korean legal judgment dataset with legal issue trees and rubric-based evaluative signals for reasoning traces, enabling reliable LLM-as-a-judge assessments of issue coverage and correctness. The dataset supports a backward-chaining evaluation of legal reasoning and demonstrates strong inter-rater reliability with human experts and substantial agreement from capable LLM evaluators. Findings show that LLMs struggle with decomposing and correctly reasoning about legal issues, and that retrieval-augmented generation and reinforcement learning with LEGIT rubrics offer complementary improvements by broadening coverage and improving correctness, respectively. The work highlights rubric-based evaluation as a key step toward achieving expert-level reasoning in high-stakes domains and points to practical avenues for improving LLM legal reasoning.

Abstract

Evaluating the quality of LLM-generated reasoning traces in expert domains (e.g., law) is essential for ensuring credibility and explainability, yet remains challenging due to the inherent complexity of such reasoning tasks. We introduce LEGIT (LEGal Issue Trees), a novel large-scale (24K instances) expert-level legal reasoning dataset with an emphasis on reasoning trace evaluation. We convert court judgments into hierarchical trees of opposing parties' arguments and the court's conclusions, which serve as rubrics for evaluating the issue coverage and correctness of the reasoning traces. We verify the reliability of these rubrics via human expert annotations and comparison with coarse, less informative rubrics. Using the LEGIT dataset, we show that (1) LLMs' legal reasoning ability is seriously affected by both legal issue coverage and correctness, and that (2) retrieval-augmented generation (RAG) and RL with rubrics bring complementary benefits for legal reasoning abilities, where RAG improves overall reasoning capability, whereas RL improves correctness albeit with reduced coverage.

Paper Structure

This paper contains 64 sections, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Overview of the LEGIT dataset and task. Facts and issue trees are extracted from real-world court judgments to serve as inputs and rubrics for the LEGIT task. See Appendix \ref{['sec:appendix-example']} for another example.
  • Figure 2: Lawyer-LLM inter-rater agreement in LEGIT score evaluation. Lawyers achieve strong agreement, ensuring that the generated rubrics are sound and effective. While strong LLMs (Gemini, GPT) achieve significant agreement with human experts, weaker open-sourced LLMs exhibit limited agreement.
  • Figure 3: Confusion matrices of individual issue labels between (lawyer vs. lawyer) and (lawyer vs. Gemini-2.0-Flash). LLM evaluators tend to overestimate the coverage and correctness compared to experts. For Krippendorff's $\alpha$, we apply an ordinal scale where the three labels correspond to 0, 2, and 5(=2+3), in order, following the score scheme.
  • Figure 4: Comparison between LLM-evaluated scores between LEGIT score and Likert scale. Even though the Likert scale prompt includes the ground truth court judgments and rubrics, the coarse granularity limits the inter-rater agreement of LLM-as-a-judge compared to modular LEGIT rubrics.
  • Figure 5: LEGIT score of 12 generator LLMs, evaluated with Gemini-2.0-Flash.
  • ...and 14 more figures