Table of Contents
Fetching ...

MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning

Huihao Jing, Wenbin Hu, Hongyu Luo, Jianhui Yang, Wei Fan, Haoran Li, Yangqiu Song

TL;DR

MASLegalBench introduces a dedicated benchmark for evaluating multi-agent systems in deductive legal reasoning using GDPR cases. It combines a knowledge base with extended IRAC-based reasoning and four specialized role-based agents, enabling structured task decomposition and collaboration. Extensive experiments across multiple Meta-LLMs and retrieval settings show that adding specialized agents and richer context yields substantial performance gains and reveals inter-agent synergies, while also highlighting potential pitfalls such as reliance on certain roles. The work provides a new direction for robust, MAS-enabled legal reasoning and offers reproducible data and code.

Abstract

Multi-agent systems (MAS), leveraging the remarkable capabilities of Large Language Models (LLMs), show great potential in addressing complex tasks. In this context, integrating MAS with legal tasks is a crucial step. While previous studies have developed legal benchmarks for LLM agents, none are specifically designed to consider the unique advantages of MAS, such as task decomposition, agent specialization, and flexible training. In fact, the lack of evaluation methods limits the potential of MAS in the legal domain. To address this gap, we propose MASLegalBench, a legal benchmark tailored for MAS and designed with a deductive reasoning approach. Our benchmark uses GDPR as the application scenario, encompassing extensive background knowledge and covering complex reasoning processes that effectively reflect the intricacies of real-world legal situations. Furthermore, we manually design various role-based MAS and conduct extensive experiments using different state-of-the-art LLMs. Our results highlight the strengths, limitations, and potential areas for improvement of existing models and MAS architectures.

MASLegalBench: Benchmarking Multi-Agent Systems in Deductive Legal Reasoning

TL;DR

MASLegalBench introduces a dedicated benchmark for evaluating multi-agent systems in deductive legal reasoning using GDPR cases. It combines a knowledge base with extended IRAC-based reasoning and four specialized role-based agents, enabling structured task decomposition and collaboration. Extensive experiments across multiple Meta-LLMs and retrieval settings show that adding specialized agents and richer context yields substantial performance gains and reveals inter-agent synergies, while also highlighting potential pitfalls such as reliance on certain roles. The work provides a new direction for robust, MAS-enabled legal reasoning and offers reproducible data and code.

Abstract

Multi-agent systems (MAS), leveraging the remarkable capabilities of Large Language Models (LLMs), show great potential in addressing complex tasks. In this context, integrating MAS with legal tasks is a crucial step. While previous studies have developed legal benchmarks for LLM agents, none are specifically designed to consider the unique advantages of MAS, such as task decomposition, agent specialization, and flexible training. In fact, the lack of evaluation methods limits the potential of MAS in the legal domain. To address this gap, we propose MASLegalBench, a legal benchmark tailored for MAS and designed with a deductive reasoning approach. Our benchmark uses GDPR as the application scenario, encompassing extensive background knowledge and covering complex reasoning processes that effectively reflect the intricacies of real-world legal situations. Furthermore, we manually design various role-based MAS and conduct extensive experiments using different state-of-the-art LLMs. Our results highlight the strengths, limitations, and potential areas for improvement of existing models and MAS architectures.

Paper Structure

This paper contains 36 sections, 6 figures, 10 tables, 2 algorithms.

Figures (6)

  • Figure 1: An overview of the enhanced IRAC reasoning process. Here, we take Birthlink (a company) as an example. In this case, a single issue is decomposed into several smaller questions, which are assigned to different agents: identifying the relevant facts and rules, inferring their alignment, and supplementing with common sense, before passing the results to the Meta-LLM for the final conclusion.
  • Figure 2: Heatmap of Cohen’s Kappa agreement across individual knowledge types and models.
  • Figure 3: Heatmap of Cohen’s Kappa agreement across different configurations for DeepSeek-v3.1 under the BM25 setting.
  • Figure 4: IRAC elements distribution across 15 cases. Each bar represents a case and is colored according to IRAC elements.
  • Figure 5: yes/no and abcd question distribution across 15 cases. Each bar represents a case and is divided by question type.
  • ...and 1 more figures