Towards Trustworthy Legal AI through LLM Agents and Formal Reasoning
Linze Chen, Yufan Cai, Zhe Hou, Jinsong Dong
TL;DR
This work presents L4M, a neural-symbolic framework that unites adversarial LLM agents with SMT-based formal verification to produce auditable legal judgments. By formalizing statutes into a unified meta-schema, instantiating articles and clauses, and enforcing consistency with a solver, L4M achieves interpretable, verifiable verdicts while outperforming strong legal LLM baselines on accuracy and robustness. The pipeline incorporates dual fact–statute extractors, autoformalization, iterative critique, and a Judge LLM to translate solver proofs into plain-language justifications, enabling step-by-step auditability. The experiments on LeCaRDv2 and perturbation datasets demonstrate improved statute grounding, reduced sentencing error, and higher resilience to factual variations, highlighting the practical impact of integrating symbolic reasoning into legal AI. Overall, the paper argues that formal grounding and solver-backed verification are essential for trustworthy AI in law, with promising implications for real-world governance and accountability.
Abstract
The rationality of law manifests in two forms: substantive rationality, which concerns the fairness or moral desirability of outcomes, and formal rationality, which requires legal decisions to follow explicitly stated, general, and logically coherent rules. Existing LLM-based systems excel at surface-level text analysis but lack the guarantees required for principled jurisprudence. We introduce L4M, a novel framework that combines adversarial LLM agents with SMT-solver-backed proofs to unite the interpretive flexibility of natural language with the rigor of symbolic verification. The pipeline consists of three phases: (1) Statute Formalization, where domain-specific prompts convert legal provisions into logical formulae; (2) Dual Fact and Statute Extraction, in which prosecutor- and defense-aligned LLMs independently map case narratives to fact tuples and statutes, ensuring role isolation; and (3) Solver-Centric Adjudication, where an autoformalizer compiles both parties' arguments into logic constraints, and unsat cores trigger iterative self-critique until a satisfiable formula is achieved, which is then verbalized by a Judge-LLM into a transparent verdict and optimized sentence. Experimental results on public benchmarks show that our system surpasses advanced LLMs including GPT-o4-mini, DeepSeek-V3, and Claude 4 as well as state-of-the-art Legal AI baselines, while providing rigorous and explainable symbolic justifications.
