Table of Contents
Fetching ...

Evaluating LLM-based Approaches to Legal Citation Prediction: Domain-specific Pre-training, Fine-tuning, or RAG? A Benchmark and an Australian Law Case Study

Jiuzhou Han, Paul Burgess, Ehsan Shareghi

TL;DR

This paper introduces the AusLaw Citation Benchmark, a large-scale Australian legal citation dataset with $55{,}005$ instances and $18{,}677$ unique citations, plus a learned Reason-of-Citation (RoC) for each reference. It evaluates a broad spectrum of approaches—prompting general-purpose and law-specialised LLMs, retrieval-only with domain-specific embeddings, instruction-tuned LLMs, and various hybrids (query expansion, voting ensembles, RAG, and re-rankers)—under open-world and closed-world settings. The key finding is that instruction tuning on task-specific data delivers the strongest gains, while pre-training alone (even on law data) is insufficient; retrieval quality, especially RoC-based index granularity, and robust re-ranking are critical to approaching the performance ceiling, though a ~50% gap remains. The benchmark and findings provide a rigorous framework to advance legal-domain AI, with practical implications for jurisdiction-aware retrieval, citation auditing, and AI-assisted litigation research.

Abstract

Large Language Models (LLMs) have demonstrated strong potential across legal tasks, yet the problem of legal citation prediction remains under-explored. At its core, this task demands fine-grained contextual understanding and precise identification of relevant legislation or precedent. We introduce the AusLaw Citation Benchmark, a real-world dataset comprising 55k Australian legal instances and 18,677 unique citations which to the best of our knowledge is the first of its scale and scope. We then conduct a systematic benchmarking across a range of solutions: (i) standard prompting of both general and law-specialised LLMs, (ii) retrieval-only pipelines with both generic and domain-specific embeddings, (iii) supervised fine-tuning, and (iv) several hybrid strategies that combine LLMs with retrieval augmentation through query expansion, voting ensembles, or re-ranking. Results show that neither general nor law-specific LLMs suffice as stand-alone solutions, with performance near zero. Instruction tuning (of even a generic open-source LLM) on task-specific dataset is among the best performing solutions. We highlight that database granularity along with the type of embeddings play a critical role in retrieval-based approaches, with hybrid methods which utilise a trained re-ranker delivering the best results. Despite this, a performance gap of nearly 50% remains, underscoring the value of this challenging benchmark as a rigorous test-bed for future research in legal-domain.

Evaluating LLM-based Approaches to Legal Citation Prediction: Domain-specific Pre-training, Fine-tuning, or RAG? A Benchmark and an Australian Law Case Study

TL;DR

This paper introduces the AusLaw Citation Benchmark, a large-scale Australian legal citation dataset with instances and unique citations, plus a learned Reason-of-Citation (RoC) for each reference. It evaluates a broad spectrum of approaches—prompting general-purpose and law-specialised LLMs, retrieval-only with domain-specific embeddings, instruction-tuned LLMs, and various hybrids (query expansion, voting ensembles, RAG, and re-rankers)—under open-world and closed-world settings. The key finding is that instruction tuning on task-specific data delivers the strongest gains, while pre-training alone (even on law data) is insufficient; retrieval quality, especially RoC-based index granularity, and robust re-ranking are critical to approaching the performance ceiling, though a ~50% gap remains. The benchmark and findings provide a rigorous framework to advance legal-domain AI, with practical implications for jurisdiction-aware retrieval, citation auditing, and AI-assisted litigation research.

Abstract

Large Language Models (LLMs) have demonstrated strong potential across legal tasks, yet the problem of legal citation prediction remains under-explored. At its core, this task demands fine-grained contextual understanding and precise identification of relevant legislation or precedent. We introduce the AusLaw Citation Benchmark, a real-world dataset comprising 55k Australian legal instances and 18,677 unique citations which to the best of our knowledge is the first of its scale and scope. We then conduct a systematic benchmarking across a range of solutions: (i) standard prompting of both general and law-specialised LLMs, (ii) retrieval-only pipelines with both generic and domain-specific embeddings, (iii) supervised fine-tuning, and (iv) several hybrid strategies that combine LLMs with retrieval augmentation through query expansion, voting ensembles, or re-ranking. Results show that neither general nor law-specific LLMs suffice as stand-alone solutions, with performance near zero. Instruction tuning (of even a generic open-source LLM) on task-specific dataset is among the best performing solutions. We highlight that database granularity along with the type of embeddings play a critical role in retrieval-based approaches, with hybrid methods which utilise a trained re-ranker delivering the best results. Despite this, a performance gap of nearly 50% remains, underscoring the value of this challenging benchmark as a rigorous test-bed for future research in legal-domain.

Paper Structure

This paper contains 34 sections, 1 equation, 3 figures, 10 tables.

Figures (3)

  • Figure 1: Average Accuracy and Number of Unique Cases for various citation frequency buckets. Accuracies per bin are based on the following settings from Table \ref{['tab:main']}: Cite-SaulLM-7B (ACC@1: 51.7), RAG (ACC@1: 42.9), and Re-ranker (ACC@1: 52.1), and from Table \ref{['tab:pre-trainings']}: Cite-AusLawLLM-7B (ACC@1: 52.0).
  • Figure 2: Examples of Catchwords from different cases in NSW Caselaw.
  • Figure 3: (Top) Frequency distribution of unique cases in the data. The red vertical dashed line marks the last case with citation frequency of 9 or higher. (Bottom) Top-20 most frequently cited cases in the data.