Table of Contents
Fetching ...

Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models

Wenhan Yu, Xinbo Lin, Lanxin Ni, Jinhua Cheng, Lei Sha

TL;DR

This paper introduces MSLR, the first Chinese multi-step legal reasoning benchmark aligned with the IRAC framework, built from 1,389 insider trading decisions and annotated with 60K step-level IRAC-Traces. It presents a scalable PPP Human-LLM annotation pipeline to generate high-quality reasoning traces and introduces the IRAC Recall and LLM-as-a-Judge metrics to evaluate reasoning quality. Empirical results show current LLMs achieve only moderate performance on legal multi-step reasoning; Human-Designed CoT yields inconsistent gains while Self-Initiated CoT consistently improves reasoning quality and coherence. The work provides open data and code and highlights the importance of model-task aligned prompting for robust legal reasoning.

Abstract

Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR-Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.

Benchmarking Multi-Step Legal Reasoning and Analyzing Chain-of-Thought Effects in Large Language Models

TL;DR

This paper introduces MSLR, the first Chinese multi-step legal reasoning benchmark aligned with the IRAC framework, built from 1,389 insider trading decisions and annotated with 60K step-level IRAC-Traces. It presents a scalable PPP Human-LLM annotation pipeline to generate high-quality reasoning traces and introduces the IRAC Recall and LLM-as-a-Judge metrics to evaluate reasoning quality. Empirical results show current LLMs achieve only moderate performance on legal multi-step reasoning; Human-Designed CoT yields inconsistent gains while Self-Initiated CoT consistently improves reasoning quality and coherence. The work provides open data and code and highlights the importance of model-task aligned prompting for robust legal reasoning.

Abstract

Large language models (LLMs) have demonstrated strong reasoning abilities across specialized domains, motivating research into their application to legal reasoning. However, existing legal benchmarks often conflate factual recall with genuine inference, fragment the reasoning process, and overlook the quality of reasoning. To address these limitations, we introduce MSLR, the first Chinese multi-step legal reasoning dataset grounded in real-world judicial decision making. MSLR adopts the IRAC framework (Issue, Rule, Application, Conclusion) to model structured expert reasoning from official legal documents. In addition, we design a scalable Human-LLM collaborative annotation pipeline that efficiently produces fine-grained step-level reasoning annotations and provides a reusable methodological framework for multi-step reasoning datasets. Evaluation of multiple LLMs on MSLR shows only moderate performance, highlighting the challenges of adapting to complex legal reasoning. Further experiments demonstrate that Self-Initiated Chain-of-Thought prompts generated by models autonomously improve reasoning coherence and quality, outperforming human-designed prompts. MSLR contributes to advancing LLM reasoning and Chain-of-Thought strategies and offers open resources for future research. The dataset and code are available at https://github.com/yuwenhan07/MSLR-Bench and https://law.sjtu.edu.cn/flszyjzx/index.html.

Paper Structure

This paper contains 32 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: A schematic of the MSLR construction and evaluation framework.
  • Figure 2: Pearson correlation analysis and linear fit between IRAC Recall and LLM Score.
  • Figure 3: Zero-Shot vs. One-Shot performance on the legal reasoning task.
  • Figure 4: H-CoT vs. S-CoT on the legal reasoning task.
  • Figure 5: Typology and coverage of Self-Initiated CoT reasoning traces in three mainstream LLMs.
  • ...and 2 more figures