Table of Contents
Fetching ...

Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments

Zheng Jia, Shengbin Yue, Wei Chen, Siyuan Wang, Yidong Liu, Yun Song, Zhongyu Wei

TL;DR

The paper identifies a gap between static legal benchmarks and the dynamic, procedural nature of real-world legal practice and introduces J1-ENVS and J1-EVAL as a dynamic, interactive benchmark and evaluation framework. J1-ENVS creates six scenarios across three hierarchical levels with role agents grounded in real Chinese legal data, while J1-EVAL employs 508 instances and dual metrics (outcome-focused and process-focused) to assess both deliverables and procedural compliance. Benchmarking 17 LLM agents reveals robust legal knowledge but substantial difficulties in procedural execution in dynamic settings, with GPT-4o not achieving exceedingly high overall performance. The work demonstrates the need for enhanced procedural and multi-agent coordination abilities and positions J1-ENVS/J1-EVAL as foundational tools for future data generation and reinforcement learning to advance dynamic legal intelligence across legal systems.

Abstract

The gap between static benchmarks and the dynamic nature of real-world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity. We further introduce J1-EVAL, a fine-grained evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that, while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model, GPT-4o, falls short of 60% overall performance. These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.

Ready Jurist One: Benchmarking Language Agents for Legal Intelligence in Dynamic Environments

TL;DR

The paper identifies a gap between static legal benchmarks and the dynamic, procedural nature of real-world legal practice and introduces J1-ENVS and J1-EVAL as a dynamic, interactive benchmark and evaluation framework. J1-ENVS creates six scenarios across three hierarchical levels with role agents grounded in real Chinese legal data, while J1-EVAL employs 508 instances and dual metrics (outcome-focused and process-focused) to assess both deliverables and procedural compliance. Benchmarking 17 LLM agents reveals robust legal knowledge but substantial difficulties in procedural execution in dynamic settings, with GPT-4o not achieving exceedingly high overall performance. The work demonstrates the need for enhanced procedural and multi-agent coordination abilities and positions J1-ENVS/J1-EVAL as foundational tools for future data generation and reinforcement learning to advance dynamic legal intelligence across legal systems.

Abstract

The gap between static benchmarks and the dynamic nature of real-world legal practice poses a key barrier to advancing legal intelligence. To this end, we introduce J1-ENVS, the first interactive and dynamic legal environment tailored for LLM-based agents. Guided by legal experts, it comprises six representative scenarios from Chinese legal practices across three levels of environmental complexity. We further introduce J1-EVAL, a fine-grained evaluation framework, designed to assess both task performance and procedural compliance across varying levels of legal proficiency. Extensive experiments on 17 LLM agents reveal that, while many models demonstrate solid legal knowledge, they struggle with procedural execution in dynamic settings. Even the SOTA model, GPT-4o, falls short of 60% overall performance. These findings highlight persistent challenges in achieving dynamic legal intelligence and offer valuable insights to guide future research.

Paper Structure

This paper contains 57 sections, 1 equation, 54 figures, 15 tables, 3 algorithms.

Figures (54)

  • Figure 1: An illustration of J1-ENVS construction pipeline. (A) Role Agent Setting: We synthesize real-world legal sources and personality theories to construct heterogeneous agents. (B) Multi-level Environment Construction: We structure these roles within specific procedures and relationships to form environments.
  • Figure 2: Distribution of legal attributes for six environments in J1-EVAL, showing a wide range of coverage.
  • Figure 3: Overall performance ranking across different LLM agent sizes.
  • Figure 4: Procedural-following performance of legal agents in civil and criminal courts. (a) Completion rate of court stages in Civil & Criminal Court. (b) Proportion of failed cases attributed to each cause.
  • Figure 5: Overall performance of legal agents with J1-ENVS driven by GPT-4o or Qwen3-Instruct-32B.
  • ...and 49 more figures