Table of Contents
Fetching ...

L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search

Ziqi Wang, Boqin Yuan

Abstract

We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a multi-agent retrieval framework for grounded legal question answering that decomposes queries into structured sub-problems, retrieves evidence via agentic web search, filters results through a verification agent, and synthesizes cited answers. Existing legal QA benchmarks test either closed-book reasoning or retrieval over fixed corpora, but neither captures scenarios requiring current legal information. We introduce LegalSearchQA, a 50-question benchmark across five legal domains whose answers depend on recent developments that post-date model training data. L-MARS achieves 96.0% accuracy on LegalSearchQA, a 38.0% improvement over zero-shot performance (58.0%), while chain-of-thought prompting degrades performance to 30.0%. On Bar Exam QA (Zheng et al., 2025), a reasoning-focused benchmark of 594 bar examination questions, retrieval provides negligible gains (+0.7 percentage points), consistent with prior findings. These results show that agentic retrieval dramatically improves legal QA when tasks require up-to-date factual knowledge, but the benefit is benchmark-dependent, underscoring the need for retrieval-focused evaluation. Code and data are available at: https://github.com/boqiny/L-MARS

L-MARS: Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search

Abstract

We present L-MARS (Legal Multi-Agent Workflow with Orchestrated Reasoning and Agentic Search), a multi-agent retrieval framework for grounded legal question answering that decomposes queries into structured sub-problems, retrieves evidence via agentic web search, filters results through a verification agent, and synthesizes cited answers. Existing legal QA benchmarks test either closed-book reasoning or retrieval over fixed corpora, but neither captures scenarios requiring current legal information. We introduce LegalSearchQA, a 50-question benchmark across five legal domains whose answers depend on recent developments that post-date model training data. L-MARS achieves 96.0% accuracy on LegalSearchQA, a 38.0% improvement over zero-shot performance (58.0%), while chain-of-thought prompting degrades performance to 30.0%. On Bar Exam QA (Zheng et al., 2025), a reasoning-focused benchmark of 594 bar examination questions, retrieval provides negligible gains (+0.7 percentage points), consistent with prior findings. These results show that agentic retrieval dramatically improves legal QA when tasks require up-to-date factual knowledge, but the benefit is benchmark-dependent, underscoring the need for retrieval-focused evaluation. Code and data are available at: https://github.com/boqiny/L-MARS

Paper Structure

This paper contains 36 sections, 1 equation, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of the two L-MARS operating modes. (a) Simple Mode: executes a single-pass retrieval–summarization pipeline. (b) Multi-Turn Mode: adds Judge Agent–guided iterations with sufficiency checks and query refinement.
  • Figure 2: Per-domain accuracy on LegalSearchQA. L-MARS achieves near-perfect accuracy across all domains. CoT falls below the 25% random baseline in three of five domains.
  • Figure 3: Benchmark comparison. On retrieval-dependent LegalSearchQA, L-MARS gains 38 percentage points over zero-shot. On reasoning-focused Bar Exam QA, retrieval provides negligible improvement.