Beyond Guilt: Legal Judgment Prediction with Trichotomous Reasoning
Kepu Zhang, Haoyue Yang, Xu Tang, Weijie Yu, Jun Xu
TL;DR
The paper tackles the gap in legal judgment prediction by introducing LJPIV, a benchmark that incorporates innocent verdicts and enables trichotomous reasoning across elements of the offense, unlawfulness, and culpability. It develops a three-stage data augmentation pipeline—sentence extraction, grounds-for-justification injection via retrieval-augmented generation, and rigorous quality verification—to produce not-guilty labeled data across three datasets, followed by two avenues for trichotomous reasoning: a prompt-based method and a LoRA-finetuning approach. Experiments show current legal LLMs struggle with innocence, while the proposed strategies improve both in-domain and cross-domain performance, especially for innocent verdicts. The work provides a practical dataset and methodological framework to advance legally-informed AI for civil-law contexts and highlights the need for explicit innocence reasoning in deployed legal NLP systems.
Abstract
In legal practice, judges apply the trichotomous dogmatics of criminal law, sequentially assessing the elements of the offense, unlawfulness, and culpability to determine whether an individual's conduct constitutes a crime. Although current legal large language models (LLMs) show promising accuracy in judgment prediction, they lack trichotomous reasoning capabilities due to the absence of an appropriate benchmark dataset, preventing them from predicting innocent outcomes. As a result, every input is automatically assigned a charge, limiting their practical utility in legal contexts. To bridge this gap, we introduce LJPIV, the first benchmark dataset for Legal Judgment Prediction with Innocent Verdicts. Adhering to the trichotomous dogmatics, we extend three widely-used legal datasets through LLM-based augmentation and manual verification. Our experiments with state-of-the-art legal LLMs and novel strategies that integrate trichotomous reasoning into zero-shot prompting and fine-tuning reveal: (1) current legal LLMs have significant room for improvement, with even the best models achieving an F1 score of less than 0.3 on LJPIV; and (2) our strategies notably enhance both in-domain and cross-domain judgment prediction accuracy, especially for cases resulting in an innocent verdict.
