Evaluating LLM-based Approaches to Legal Citation Prediction: Domain-specific Pre-training, Fine-tuning, or RAG? A Benchmark and an Australian Law Case Study
Jiuzhou Han, Paul Burgess, Ehsan Shareghi
TL;DR
This paper introduces the AusLaw Citation Benchmark, a large-scale Australian legal citation dataset with $55{,}005$ instances and $18{,}677$ unique citations, plus a learned Reason-of-Citation (RoC) for each reference. It evaluates a broad spectrum of approaches—prompting general-purpose and law-specialised LLMs, retrieval-only with domain-specific embeddings, instruction-tuned LLMs, and various hybrids (query expansion, voting ensembles, RAG, and re-rankers)—under open-world and closed-world settings. The key finding is that instruction tuning on task-specific data delivers the strongest gains, while pre-training alone (even on law data) is insufficient; retrieval quality, especially RoC-based index granularity, and robust re-ranking are critical to approaching the performance ceiling, though a ~50% gap remains. The benchmark and findings provide a rigorous framework to advance legal-domain AI, with practical implications for jurisdiction-aware retrieval, citation auditing, and AI-assisted litigation research.
Abstract
Large Language Models (LLMs) have demonstrated strong potential across legal tasks, yet the problem of legal citation prediction remains under-explored. At its core, this task demands fine-grained contextual understanding and precise identification of relevant legislation or precedent. We introduce the AusLaw Citation Benchmark, a real-world dataset comprising 55k Australian legal instances and 18,677 unique citations which to the best of our knowledge is the first of its scale and scope. We then conduct a systematic benchmarking across a range of solutions: (i) standard prompting of both general and law-specialised LLMs, (ii) retrieval-only pipelines with both generic and domain-specific embeddings, (iii) supervised fine-tuning, and (iv) several hybrid strategies that combine LLMs with retrieval augmentation through query expansion, voting ensembles, or re-ranking. Results show that neither general nor law-specific LLMs suffice as stand-alone solutions, with performance near zero. Instruction tuning (of even a generic open-source LLM) on task-specific dataset is among the best performing solutions. We highlight that database granularity along with the type of embeddings play a critical role in retrieval-based approaches, with hybrid methods which utilise a trained re-ranker delivering the best results. Despite this, a performance gap of nearly 50% remains, underscoring the value of this challenging benchmark as a rigorous test-bed for future research in legal-domain.
