Table of Contents
Fetching ...

LegalSim: Multi-Agent Simulation of Legal Systems for Discovering Procedural Exploits

Sanket Badhe

TL;DR

LegalSim introduces a modular multi-agent simulation to study how AI-driven strategies can exploit codified legal procedures. By encoding rules as a JSON rules engine and using a stochastic judge, the environment enables training four policy families and measuring exploit risk with a composite, regime-agnostic metric. Emergent exploit chains such as cost inflation and calendar pressure are observed, with PPO achieving the strongest win rate and the contextual bandit remaining broadly competitive, underscoring the need for rule-system red-teaming and governance. The framework provides a controlled setting for AI safety, legal NLP, and robustness research across diverse domains like bankruptcy, patent, and tax proceedings.

Abstract

We present LegalSim, a modular multi-agent simulation of adversarial legal proceedings that explores how AI systems can exploit procedural weaknesses in codified rules. Plaintiff and defendant agents choose from a constrained action space (for example, discovery requests, motions, meet-and-confer, sanctions) governed by a JSON rules engine, while a stochastic judge model with calibrated grant rates, cost allocations, and sanction tendencies resolves outcomes. We compare four policies: PPO, a contextual bandit with an LLM, a direct LLM policy, and a hand-crafted heuristic; Instead of optimizing binary case outcomes, agents are trained and evaluated using effective win rate and a composite exploit score that combines opponent-cost inflation, calendar pressure, settlement pressure at low merit, and a rule-compliance margin. Across configurable regimes (e.g., bankruptcy stays, inter partes review, tax procedures) and heterogeneous judges, we observe emergent ``exploit chains'', such as cost-inflating discovery sequences and calendar-pressure tactics that remain procedurally valid yet systemically harmful. Evaluation via cross-play and Bradley-Terry ratings shows, PPO wins more often, the bandit is the most consistently competitive across opponents, the LLM trails them, and the heuristic is weakest. The results are stable in judge settings, and the simulation reveals emergent exploit chains, motivating red-teaming of legal rule systems in addition to model-level testing.

LegalSim: Multi-Agent Simulation of Legal Systems for Discovering Procedural Exploits

TL;DR

LegalSim introduces a modular multi-agent simulation to study how AI-driven strategies can exploit codified legal procedures. By encoding rules as a JSON rules engine and using a stochastic judge, the environment enables training four policy families and measuring exploit risk with a composite, regime-agnostic metric. Emergent exploit chains such as cost inflation and calendar pressure are observed, with PPO achieving the strongest win rate and the contextual bandit remaining broadly competitive, underscoring the need for rule-system red-teaming and governance. The framework provides a controlled setting for AI safety, legal NLP, and robustness research across diverse domains like bankruptcy, patent, and tax proceedings.

Abstract

We present LegalSim, a modular multi-agent simulation of adversarial legal proceedings that explores how AI systems can exploit procedural weaknesses in codified rules. Plaintiff and defendant agents choose from a constrained action space (for example, discovery requests, motions, meet-and-confer, sanctions) governed by a JSON rules engine, while a stochastic judge model with calibrated grant rates, cost allocations, and sanction tendencies resolves outcomes. We compare four policies: PPO, a contextual bandit with an LLM, a direct LLM policy, and a hand-crafted heuristic; Instead of optimizing binary case outcomes, agents are trained and evaluated using effective win rate and a composite exploit score that combines opponent-cost inflation, calendar pressure, settlement pressure at low merit, and a rule-compliance margin. Across configurable regimes (e.g., bankruptcy stays, inter partes review, tax procedures) and heterogeneous judges, we observe emergent ``exploit chains'', such as cost-inflating discovery sequences and calendar-pressure tactics that remain procedurally valid yet systemically harmful. Evaluation via cross-play and Bradley-Terry ratings shows, PPO wins more often, the bandit is the most consistently competitive across opponents, the LLM trails them, and the heuristic is weakest. The results are stable in judge settings, and the simulation reveals emergent exploit chains, motivating red-teaming of legal rule systems in addition to model-level testing.

Paper Structure

This paper contains 49 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Win rate by policy & judge (settlement $=0.5$). Bars show mean effective win rate across ten seeds under permissive and strict judges; higher is better for the policy.
  • Figure 2: Cross-play performance heatmaps (role-symmetric). Left: win rate with settlements counted as $0.5$, averaged over both role assignments. Right: composite margin (plaintiff composite $-$ defendant composite) with the sign flipped when roles swap. Rows index the row policy $i$ and columns the opponent policy $j$; numbers are cell means across seeds and judges. Higher (warmer) values indicate that the row policy systematically outperforms (or exerts more procedural pressure than) the column policy.