Table of Contents
Fetching ...

LegalAgentBench: Evaluating LLM Agents in Legal Domain

Haitao Li, Junjie Chen, Jingli Yang, Qingyao Ai, Wei Jia, Youfeng Liu, Kai Lin, Yueyue Wu, Guozhi Yuan, Yiran Hu, Wuyue Wang, Yiqun Liu, Minlie Huang

TL;DR

LegalAgentBench addresses the lack of benchmark standards for evaluating LLM agents in the Chinese legal domain by introducing an environment with 17 corpora and 37 external tools, plus a scalable task-construction framework that yields 300 tasks. It combines multi-hop reasoning and writing tasks with fine-grained process metrics (progress via intermediate keywords) to assess not only final outcomes but also solution trajectories. The study evaluates eight popular LLMs using three interaction methods, revealing differences in tool usage and reasoning capabilities, and highlighting persistent gaps in domain-specific legal knowledge and reasoning. The work provides a MIT-licensed dataset and code, discusses ethical considerations, and outlines future directions to broaden languages and legal systems for broader, responsible AI-enabled legal practice.

Abstract

With the increasing intelligence and autonomy of LLM agents, their potential applications in the legal domain are becoming increasingly apparent. However, existing general-domain benchmarks cannot fully capture the complexity and subtle nuances of real-world judicial cognition and decision-making. Therefore, we propose LegalAgentBench, a comprehensive benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge. We designed a scalable task construction framework and carefully annotated 300 tasks. These tasks span various types, including multi-hop reasoning and writing, and range across different difficulty levels, effectively reflecting the complexity of real-world legal scenarios. Moreover, beyond evaluating final success, LegalAgentBench incorporates keyword analysis during intermediate processes to calculate progress rates, enabling more fine-grained evaluation. We evaluated eight popular LLMs, highlighting the strengths, limitations, and potential areas for improvement of existing models and methods. LegalAgentBench sets a new benchmark for the practical application of LLMs in the legal domain, with its code and data available at \url{https://github.com/CSHaitao/LegalAgentBench}.

LegalAgentBench: Evaluating LLM Agents in Legal Domain

TL;DR

LegalAgentBench addresses the lack of benchmark standards for evaluating LLM agents in the Chinese legal domain by introducing an environment with 17 corpora and 37 external tools, plus a scalable task-construction framework that yields 300 tasks. It combines multi-hop reasoning and writing tasks with fine-grained process metrics (progress via intermediate keywords) to assess not only final outcomes but also solution trajectories. The study evaluates eight popular LLMs using three interaction methods, revealing differences in tool usage and reasoning capabilities, and highlighting persistent gaps in domain-specific legal knowledge and reasoning. The work provides a MIT-licensed dataset and code, discusses ethical considerations, and outlines future directions to broaden languages and legal systems for broader, responsible AI-enabled legal practice.

Abstract

With the increasing intelligence and autonomy of LLM agents, their potential applications in the legal domain are becoming increasingly apparent. However, existing general-domain benchmarks cannot fully capture the complexity and subtle nuances of real-world judicial cognition and decision-making. Therefore, we propose LegalAgentBench, a comprehensive benchmark specifically designed to evaluate LLM Agents in the Chinese legal domain. LegalAgentBench includes 17 corpora from real-world legal scenarios and provides 37 tools for interacting with external knowledge. We designed a scalable task construction framework and carefully annotated 300 tasks. These tasks span various types, including multi-hop reasoning and writing, and range across different difficulty levels, effectively reflecting the complexity of real-world legal scenarios. Moreover, beyond evaluating final success, LegalAgentBench incorporates keyword analysis during intermediate processes to calculate progress rates, enabling more fine-grained evaluation. We evaluated eight popular LLMs, highlighting the strengths, limitations, and potential areas for improvement of existing models and methods. LegalAgentBench sets a new benchmark for the practical application of LLMs in the legal domain, with its code and data available at \url{https://github.com/CSHaitao/LegalAgentBench}.

Paper Structure

This paper contains 32 sections, 4 equations, 2 figures, 15 tables.

Figures (2)

  • Figure 1: A task example in LegalAgentBench (translated from Chinese).
  • Figure 2: The overview of the task construction process in LegalAgentBench.