Table of Contents
Fetching ...

Multi-Source Retrieval and Reasoning for Legal Sentencing Prediction

Junjie Chen, Haitao Li, Qilei Zhang, Zhenghua Li, Ya Zhang, Quan Zhou, Cheng Luo, Yiqun Liu, Dongsheng Guo, Qingyao Ai

TL;DR

This work tackles Legal Sentencing Prediction (LSP) by introducing $MSR^2$, a reinforcement learning–driven framework that performs routable multi-source retrieval and employs a process-level reward to supervise intermediate reasoning. By integrating sources such as statutes, judicial interpretations, and sentencing guidelines directly into the reasoning process, the model achieves more accurate and interpretable sentencing decisions than prior semantic- or logic-focused approaches. Training with Group Relative Policy Optimization (GRPO) further stabilizes learning by using group-based baselines and masking retrieved tokens during optimization. Experiments on CAIL2018 and CJO22 demonstrate state-of-the-art performance and provide qualitative case studies showing transparent, source-grounded reasoning paths with tangible improvements in both prediction accuracy and interpretability.

Abstract

Legal judgment prediction (LJP) aims to predict judicial outcomes from case facts and typically includes law article, charge, and sentencing prediction. While recent methods perform well on the first two subtasks, legal sentencing prediction (LSP) remains difficult due to its need for fine-grained objective knowledge and flexible subjective reasoning. To address these limitations, we propose $MSR^2$, a framework that integrates multi-source retrieval and reasoning in LLMs with reinforcement learning. $MSR^2$ enables LLMs to perform multi-source retrieval based on reasoning needs and applies a process-level reward to guide intermediate subjective reasoning steps. Experiments on two real-world datasets show that $MSR^2$ improves both accuracy and interpretability in LSP, providing a promising step toward practical legal AI. Our code is available at https://anonymous.4open.science/r/MSR2-FC3B.

Multi-Source Retrieval and Reasoning for Legal Sentencing Prediction

TL;DR

This work tackles Legal Sentencing Prediction (LSP) by introducing , a reinforcement learning–driven framework that performs routable multi-source retrieval and employs a process-level reward to supervise intermediate reasoning. By integrating sources such as statutes, judicial interpretations, and sentencing guidelines directly into the reasoning process, the model achieves more accurate and interpretable sentencing decisions than prior semantic- or logic-focused approaches. Training with Group Relative Policy Optimization (GRPO) further stabilizes learning by using group-based baselines and masking retrieved tokens during optimization. Experiments on CAIL2018 and CJO22 demonstrate state-of-the-art performance and provide qualitative case studies showing transparent, source-grounded reasoning paths with tangible improvements in both prediction accuracy and interpretability.

Abstract

Legal judgment prediction (LJP) aims to predict judicial outcomes from case facts and typically includes law article, charge, and sentencing prediction. While recent methods perform well on the first two subtasks, legal sentencing prediction (LSP) remains difficult due to its need for fine-grained objective knowledge and flexible subjective reasoning. To address these limitations, we propose , a framework that integrates multi-source retrieval and reasoning in LLMs with reinforcement learning. enables LLMs to perform multi-source retrieval based on reasoning needs and applies a process-level reward to guide intermediate subjective reasoning steps. Experiments on two real-world datasets show that improves both accuracy and interpretability in LSP, providing a promising step toward practical legal AI. Our code is available at https://anonymous.4open.science/r/MSR2-FC3B.
Paper Structure (24 sections, 6 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 6 equations, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: A concrete example of legal sentencing prediction. The case facts at the top are used as input. The bottom part presents the human judge’s reasoning and the final sentence. This example shows that legal sentencing prediction needs finer-grained objective knowledge (e.g., large / huge / especially huge thresholds) and flexible subjective reasoning (e.g., mitigation). Existing methods often miss these elements, and our method fills this gap.
  • Figure 2: Overview of the $MSR^2$ framework. (1) Multi-source retrieval: the policy LLM issues a <search> request with an optional target source such as <statute> or <guideline>. The query is then routed to the corresponding retriever, and the retrieved top-k evidence is injected into <information> to support further reasoning. (2) Process-level reward: the LLM enumerates both objective and subjective sentencing factors in <factors>. The quality of these factors is evaluated by the LLM-as-a-judge to yield a process reward for guidance. (3) RL optimization: the model is trained end-to-end with the GRPO algorithm.
  • Figure 3: Prompt used for scoring whether the proposed sentencing factors are supported by the case facts.
  • Figure 4: Case study of a fraud case. $MSR^2$ demonstrates a transparent decision process by selectively routing queries to diverse and appropriate sources, and performing sound subjective reasoning by integrating retrieved information with case facts. The model correctly anchors the baseline term by verifying the specific amount threshold and subsequently adjusts the sentence based on mitigation factors, forming a coherent chain from facts to judgment.