Table of Contents
Fetching ...

Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning

Venkatesh Mishra, Bimsara Pathiraja, Mihir Parmar, Sat Chidananda, Jayanth Srinivasa, Gaowen Liu, Ali Payani, Chitta Baral

TL;DR

The paper tackles the reliability gap in step-by-step legal reasoning by introducing a fine-grained error taxonomy for Premise- and Conclusion-level reasoning and an LLM-driven auto-evaluator to quantify reasoning quality on the Civ. Pro. MCQA dataset. It demonstrates that LLMs often produce sound premises but still fail to reach correct, fully justified conclusions, with misinterpretations and false-premise propagation as dominant error modes. The authors show that incorporating taxonomy-based feedback into prompting strategies yields modest improvements (up to ~4%), and they provide an evaluation framework that can scale to other complex, logic-heavy domains. The work contributes a practical, automated approach to dissect and mitigate reasoning errors in legal AI, with potential to improve reliability in high-stakes applications and to extend to other domains requiring rigorous step-by-step justification.

Abstract

Reasoning abilities of LLMs have been a key focus in recent years. One challenging reasoning domain with interesting nuances is legal reasoning, which requires careful application of rules, and precedents while balancing deductive and analogical reasoning, and conflicts between rules. Although there have been a few works on using LLMs for legal reasoning, their focus has been on overall accuracy. In this paper, we dig deeper to do a step-by-step analysis and figure out where they commit errors. We use the college-level Multiple Choice Question-Answering (MCQA) task from the \textit{Civil Procedure} dataset and propose a new error taxonomy derived from initial manual analysis of reasoning chains with respect to several LLMs, including two objective measures: soundness and correctness scores. We then develop an LLM-based automated evaluation framework to identify reasoning errors and evaluate the performance of LLMs. The computation of soundness and correctness on the dataset using the auto-evaluator framework reveals several interesting insights. Furthermore, we show that incorporating the error taxonomy as feedback in popular prompting techniques marginally increases LLM performance. Our work will also serve as an evaluation framework that can be used in detailed error analysis of reasoning chains for logic-intensive complex tasks.

Investigating the Shortcomings of LLMs in Step-by-Step Legal Reasoning

TL;DR

The paper tackles the reliability gap in step-by-step legal reasoning by introducing a fine-grained error taxonomy for Premise- and Conclusion-level reasoning and an LLM-driven auto-evaluator to quantify reasoning quality on the Civ. Pro. MCQA dataset. It demonstrates that LLMs often produce sound premises but still fail to reach correct, fully justified conclusions, with misinterpretations and false-premise propagation as dominant error modes. The authors show that incorporating taxonomy-based feedback into prompting strategies yields modest improvements (up to ~4%), and they provide an evaluation framework that can scale to other complex, logic-heavy domains. The work contributes a practical, automated approach to dissect and mitigate reasoning errors in legal AI, with potential to improve reliability in high-stakes applications and to extend to other domains requiring rigorous step-by-step justification.

Abstract

Reasoning abilities of LLMs have been a key focus in recent years. One challenging reasoning domain with interesting nuances is legal reasoning, which requires careful application of rules, and precedents while balancing deductive and analogical reasoning, and conflicts between rules. Although there have been a few works on using LLMs for legal reasoning, their focus has been on overall accuracy. In this paper, we dig deeper to do a step-by-step analysis and figure out where they commit errors. We use the college-level Multiple Choice Question-Answering (MCQA) task from the \textit{Civil Procedure} dataset and propose a new error taxonomy derived from initial manual analysis of reasoning chains with respect to several LLMs, including two objective measures: soundness and correctness scores. We then develop an LLM-based automated evaluation framework to identify reasoning errors and evaluate the performance of LLMs. The computation of soundness and correctness on the dataset using the auto-evaluator framework reveals several interesting insights. Furthermore, we show that incorporating the error taxonomy as feedback in popular prompting techniques marginally increases LLM performance. Our work will also serve as an evaluation framework that can be used in detailed error analysis of reasoning chains for logic-intensive complex tasks.

Paper Structure

This paper contains 45 sections, 2 equations, 20 figures, 18 tables.

Figures (20)

  • Figure 1: An example of determining domicile in a legal context. A reasoner must discern whether the condition of 'indefinite to stay in a place' is met. While many LLMs predict Marla is domiciled in Montana since her program is only for 2 years, legally, her ambiguous plans indicate an intent to remain in Denver indefinitely, making her domiciled in Denver, not Montana.
  • Figure 2: Overview of the proposed pipeline for evaluating legal reasoning in LLMs. The process begins with converting Civ. Pro. dataset (top left), followed by generating reasoning chains using LLMs in a zero-shot CoT setting (bottom left). These chains are manually analyzed for various error types (top right), based on the proposed error taxonomy. The pipeline is then automated using an LLM-based system (bottom right) to assess reasoning chains for errors such as misinterpretation, providing insights into the LLMs' reasoning accuracy.
  • Figure 3: The overall schematic representation of the LLM-based error-detection and evaluation system and the calculation of the metrics. The reasoning chains are produced by 5 LLMs and the expert answer is referenced from the Civ. Pro. dataset
  • Figure 4: Performance of 5 LLMs in terms of Accuracy vs. Correctness on the Civ. Pro. dataset. Here, Mistral stands for Mistral-7B-v2-Instruct, Llama stands for Llama-3-8B-Instruct, GPT-3.5t and GPT-4t stand for GPT-3.5-turbo and GPT-4-turbo respectively.
  • Figure 5: The percentage distribution of the premise-level error categories across the reasoning chains of all 5 LLMs. The total number of steps generated by each model is provided inside the round brackets below the model names. Here ’NE’ denotes Correct Premise (No errors), ’M’ denotes Premise containing a Misinterpretation, ’FH’ denotes Factual Hallucination in the premise, ’IP’ denotes an Irrelevant Premise.
  • ...and 15 more figures