Table of Contents
Fetching ...

CitaLaw: Enhancing LLM with Citations in Legal Domain

Kepu Zhang, Weijie Yu, Sunhao Dai, Jun Xu

TL;DR

CitaLaw introduces a legal-citation benchmark that evaluates LLMs on generating legally grounded responses with context-aware citations, using two audience-specific subsets (laypersons and practitioners). It couples a retrieval-augmented generation framework with a novel syllogism-based evaluation that links major premises (law articles or precedents), minor premises (case circumstances), and conclusions (legal decisions). The corpus combines law articles and precedents (≈500k documents) and supports two generation strategies—CGG and ARG—demonstrating that explicit references improve response quality and that syllogism-based metrics align with human judgments. The work provides practical guidance for deploying trustworthy legal LLMs and highlights the nuanced performance differences between open-domain and legal-domain models across retrieval and NLI configurations.

Abstract

In this paper, we propose CitaLaw, the first benchmark designed to evaluate LLMs' ability to produce legally sound responses with appropriate citations. CitaLaw features a diverse set of legal questions for both laypersons and practitioners, paired with a comprehensive corpus of law articles and precedent cases as a reference pool. This framework enables LLM-based systems to retrieve supporting citations from the reference corpus and align these citations with the corresponding sentences in their responses. Moreover, we introduce syllogism-inspired evaluation methods to assess the legal alignment between retrieved references and LLM-generated responses, as well as their consistency with user questions. Extensive experiments on 2 open-domain and 7 legal-specific LLMs demonstrate that integrating legal references substantially enhances response quality. Furthermore, our proposed syllogism-based evaluation method exhibits strong agreement with human judgments.

CitaLaw: Enhancing LLM with Citations in Legal Domain

TL;DR

CitaLaw introduces a legal-citation benchmark that evaluates LLMs on generating legally grounded responses with context-aware citations, using two audience-specific subsets (laypersons and practitioners). It couples a retrieval-augmented generation framework with a novel syllogism-based evaluation that links major premises (law articles or precedents), minor premises (case circumstances), and conclusions (legal decisions). The corpus combines law articles and precedents (≈500k documents) and supports two generation strategies—CGG and ARG—demonstrating that explicit references improve response quality and that syllogism-based metrics align with human judgments. The work provides practical guidance for deploying trustworthy legal LLMs and highlights the nuanced performance differences between open-domain and legal-domain models across retrieval and NLI configurations.

Abstract

In this paper, we propose CitaLaw, the first benchmark designed to evaluate LLMs' ability to produce legally sound responses with appropriate citations. CitaLaw features a diverse set of legal questions for both laypersons and practitioners, paired with a comprehensive corpus of law articles and precedent cases as a reference pool. This framework enables LLM-based systems to retrieve supporting citations from the reference corpus and align these citations with the corresponding sentences in their responses. Moreover, we introduce syllogism-inspired evaluation methods to assess the legal alignment between retrieved references and LLM-generated responses, as well as their consistency with user questions. Extensive experiments on 2 open-domain and 7 legal-specific LLMs demonstrate that integrating legal references substantially enhances response quality. Furthermore, our proposed syllogism-based evaluation method exhibits strong agreement with human judgments.

Paper Structure

This paper contains 31 sections, 5 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: The framework of our CitaLaw.
  • Figure 2: Examples from the two subsets of CitaLaw, with text in red, blue, and yellow representing the three dimensions of the syllogism: major premise, minor premise (circumstances, illegal acts), and conclusion (legal decisions), respectively. [A] and [C] denote citations to relevant law articles and precedent cases, respectively.
  • Figure 3: Performance of different retrieval models. Lay is short for Layperson dataset and Pra is short for Practitioner dataset.
  • Figure 4: The performance of different NLI models when the LLM is Llama.
  • Figure 5: Prompts used in this paper. (a) The prompt $p_1$ is used to retrieve one law article in the Layperson dataset. (b) The prompt $p_1$ is used to retrieve one law article and three precedent cases in the Practitioner dataset. (c) The prompt $p_3$ is used to refine the LLM's answer based on references. (d) The prompt $p_2$ is used for LLM responses without references.
  • ...and 1 more figures