Table of Contents
Fetching ...

CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation

Haitao Li, Jiaying Ye, Yiran Hu, Jia Chen, Qingyao Ai, Yueyue Wu, Junjie Chen, Yifan Chen, Cheng Luo, Quan Zhou, Yiqun Liu

TL;DR

CaseGen introduces the first comprehensive benchmark for multi-stage legal case document generation in the Chinese domain, built from 500 real cases and covering seven document sections. It defines four generation tasks that mirror real-world drafting stages and employs an automated LLM-as-a-judge evaluation framework validated by human annotations. Experimental results show current LLMs struggle with the complexities of legal document generation, though open-source models can be competitive and the judge-based evaluation aligns reasonably with human judgments. This benchmark lays the groundwork for rigorous, scalable evaluation of AI-assisted legal drafting and highlights avenues for model improvement and domain-specific tooling.

Abstract

Legal case documents play a critical role in judicial proceedings. As the number of cases continues to rise, the reliance on manual drafting of legal case documents is facing increasing pressure and challenges. The development of large language models (LLMs) offers a promising solution for automating document generation. However, existing benchmarks fail to fully capture the complexities involved in drafting legal case documents in real-world scenarios. To address this gap, we introduce CaseGen, the benchmark for multi-stage legal case documents generation in the Chinese legal domain. CaseGen is based on 500 real case samples annotated by legal experts and covers seven essential case sections. It supports four key tasks: drafting defense statements, writing trial facts, composing legal reasoning, and generating judgment results. To the best of our knowledge, CaseGen is the first benchmark designed to evaluate LLMs in the context of legal case document generation. To ensure an accurate and comprehensive evaluation, we design the LLM-as-a-judge evaluation framework and validate its effectiveness through human annotations. We evaluate several widely used general-domain LLMs and legal-specific LLMs, highlighting their limitations in case document generation and pinpointing areas for potential improvement. This work marks a step toward a more effective framework for automating legal case documents drafting, paving the way for the reliable application of AI in the legal field. The dataset and code are publicly available at https://github.com/CSHaitao/CaseGen.

CaseGen: A Benchmark for Multi-Stage Legal Case Documents Generation

TL;DR

CaseGen introduces the first comprehensive benchmark for multi-stage legal case document generation in the Chinese domain, built from 500 real cases and covering seven document sections. It defines four generation tasks that mirror real-world drafting stages and employs an automated LLM-as-a-judge evaluation framework validated by human annotations. Experimental results show current LLMs struggle with the complexities of legal document generation, though open-source models can be competitive and the judge-based evaluation aligns reasonably with human judgments. This benchmark lays the groundwork for rigorous, scalable evaluation of AI-assisted legal drafting and highlights avenues for model improvement and domain-specific tooling.

Abstract

Legal case documents play a critical role in judicial proceedings. As the number of cases continues to rise, the reliance on manual drafting of legal case documents is facing increasing pressure and challenges. The development of large language models (LLMs) offers a promising solution for automating document generation. However, existing benchmarks fail to fully capture the complexities involved in drafting legal case documents in real-world scenarios. To address this gap, we introduce CaseGen, the benchmark for multi-stage legal case documents generation in the Chinese legal domain. CaseGen is based on 500 real case samples annotated by legal experts and covers seven essential case sections. It supports four key tasks: drafting defense statements, writing trial facts, composing legal reasoning, and generating judgment results. To the best of our knowledge, CaseGen is the first benchmark designed to evaluate LLMs in the context of legal case document generation. To ensure an accurate and comprehensive evaluation, we design the LLM-as-a-judge evaluation framework and validate its effectiveness through human annotations. We evaluate several widely used general-domain LLMs and legal-specific LLMs, highlighting their limitations in case document generation and pinpointing areas for potential improvement. This work marks a step toward a more effective framework for automating legal case documents drafting, paving the way for the reliable application of AI in the legal field. The dataset and code are publicly available at https://github.com/CSHaitao/CaseGen.

Paper Structure

This paper contains 33 sections, 2 figures, 10 tables.

Figures (2)

  • Figure 1: An task example in CaseGen (translated from Chinese).
  • Figure 2: The overview of CaseGen. CaseGen includes four key generation tasks and uses LLMs-as-a-judge as the primary evaluation method.