Table of Contents
Fetching ...

Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning

Li Zhang, Matthias Grabmair, Morgan Gray, Kevin Ashley

TL;DR

This paper introduces a formal, three-task framework to evaluate LLMs on hierarchical, case-based legal reasoning using a factor-based representation and a CATO-style knowledge hierarchy to identify significant distinctions between a current case and a precedent. It defines Task 1 (identify distinctions), Task 2 (analyze argumentative roles via hierarchy), and Task 3 (identify significant distinctions) with a symbolic ground-truth solver and an evaluation pipeline. Empirically, surface-level accuracy remains high while hierarchical reasoning (Task 2) and integrated analysis (Task 3) degrade, and thinking-enabled models improve performance at the cost of greater reasoning tokens, revealing a disconnect between computational effort and correctness. The results underscore fundamental limitations of current reasoning LLMs for legal analysis and offer a rigorous framework to diagnose and guide the development of more robust, trustworthy legal AI systems.

Abstract

Case-based reasoning is a cornerstone of U.S. legal practice, requiring professionals to argue about a current case by drawing analogies to and distinguishing from past precedents. While Large Language Models (LLMs) have shown remarkable capabilities, their proficiency in this complex, nuanced form of reasoning needs further investigation. We propose a formal framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks. Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions, analyzing their argumentative support, and evaluating their significance. Through comprehensive evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve high accuracy on surface-level reasoning (Task 1), performance degrades on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that "thinking longer" does not always mean "thinking smarter." Our work provides a methodology for fine-grained analysis of LLM reasoning capabilities in complex domains and reveals fundamental limitations that must be addressed for robust and trustworthy legal AI.

Thinking Longer, Not Always Smarter: Evaluating LLM Capabilities in Hierarchical Legal Reasoning

TL;DR

This paper introduces a formal, three-task framework to evaluate LLMs on hierarchical, case-based legal reasoning using a factor-based representation and a CATO-style knowledge hierarchy to identify significant distinctions between a current case and a precedent. It defines Task 1 (identify distinctions), Task 2 (analyze argumentative roles via hierarchy), and Task 3 (identify significant distinctions) with a symbolic ground-truth solver and an evaluation pipeline. Empirically, surface-level accuracy remains high while hierarchical reasoning (Task 2) and integrated analysis (Task 3) degrade, and thinking-enabled models improve performance at the cost of greater reasoning tokens, revealing a disconnect between computational effort and correctness. The results underscore fundamental limitations of current reasoning LLMs for legal analysis and offer a rigorous framework to diagnose and guide the development of more robust, trustworthy legal AI systems.

Abstract

Case-based reasoning is a cornerstone of U.S. legal practice, requiring professionals to argue about a current case by drawing analogies to and distinguishing from past precedents. While Large Language Models (LLMs) have shown remarkable capabilities, their proficiency in this complex, nuanced form of reasoning needs further investigation. We propose a formal framework that decomposes the process of identifying significant distinctions between cases into three-stage reasoning tasks. Our framework models cases using factual predicates called factors, organizes them into a legal knowledge hierarchy, and defines verifiable rules for identifying distinctions, analyzing their argumentative support, and evaluating their significance. Through comprehensive evaluation of modern reasoning LLMs, we reveal a paradox: while models achieve high accuracy on surface-level reasoning (Task 1), performance degrades on hierarchical reasoning (Task 2: 64.82%-92.09%) and collapses on integrated analysis (Task 3: 11.46%-33.99%). Most strikingly, we find that models consistently expend more computational resources on incorrect responses than correct ones, suggesting that "thinking longer" does not always mean "thinking smarter." Our work provides a methodology for fine-grained analysis of LLM reasoning capabilities in complex domains and reveals fundamental limitations that must be addressed for robust and trustworthy legal AI.

Paper Structure

This paper contains 57 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: The decomposed framework for identifying significant distinctions, which consists of three steps: (1) identify distinctions, (2) analyze argumentative roles of a distinction via legal knowledge hierarchy, and (3) identify significant distinctions. Red and blue presents the favoring side of the factors.
  • Figure 2: Modifying the difficulty level of the tasks by increasing the reasoning complexity.
  • Figure 3: The evaluation pipeline, which consists of scenario generation, ground truth creation, LLM inference, and evaluation.
  • Figure 4: Model performance across Tasks 1--3. The left panel illustrates accuracy, showing a decline as tasks become more complex. The right panel displays the average number of thinking tokens used, which increases with task difficulty.
  • Figure 5: Token usage patterns reveal inefficient reasoning strategies. The figure shows the difference in thinking tokens between incorrect and correct responses across Tasks 2--3, highlighting how models often expend more computational effort on answers they ultimately get wrong.
  • ...and 2 more figures