Table of Contents
Fetching ...

A Llama walks into the 'Bar': Efficient Supervised Fine-Tuning for Legal Reasoning in the Multi-state Bar Exam

Rean Fernandes, André Biedenkapp, Frank Hutter, Noor Awad

TL;DR

This work asks whether smaller open-weight LLMs can match legal reasoning on the MBEs with limited data. By distilling explanations into an IRAC framework and applying supervised fine-tuning with QLoRa adapters on Llama-2 7B and Llama-3 8B, the authors show notable gains in accuracy and parsing reliability, though not reaching GPT-4-level performance. Structured IRAC explanations yield stronger gains for Llama-3 and dramatically improve parse-robustness, while Llama-2 benefits more modestly and requires more data. The study releases the curated SFT dataset and adapter family, establishing practical lower bounds for resource-constrained legal QA and highlighting the trade-offs between model size, data, and reasoning structure in domain-specific fine-tuning.

Abstract

Legal reasoning tasks present unique challenges for large language models (LLMs) due to the complexity of domain-specific knowledge and reasoning processes. This paper investigates how effectively smaller language models (Llama 2 7B and Llama 3 8B) can be fine-tuned with a limited dataset of 1,514 Multi-state Bar Examination (MBE) questions to improve legal question answering accuracy. We evaluate these models on the 2022 MBE questions licensed from JD Advising, the same dataset used in the 'GPT-4 passes the Bar exam' study. Our methodology involves collecting approximately 200 questions per legal domain across 7 domains. We distill the dataset using Llama 3 (70B) to transform explanations into a structured IRAC (Issue, Rule, Application, Conclusion) format as a guided reasoning process to see if it results in better performance over the non-distilled dataset. We compare the non-fine-tuned models against their supervised fine-tuned (SFT) counterparts, trained for different sample sizes per domain, to study the effect on accuracy and prompt adherence. We also analyse option selection biases and their mitigation following SFT. In addition, we consolidate the performance across multiple variables: prompt type (few-shot vs zero-shot), answer ordering (chosen-option first vs generated-explanation first), response format (Numbered list vs Markdown vs JSON), and different decoding temperatures. Our findings show that domain-specific SFT helps some model configurations achieve close to human baseline performance, despite limited computational resources and a relatively small dataset. We release both the gathered SFT dataset and the family of Supervised Fine-tuned (SFT) adapters optimised for MBE performance. This establishes a practical lower bound on resources needed towards achieving effective legal question answering in smaller LLMs.

A Llama walks into the 'Bar': Efficient Supervised Fine-Tuning for Legal Reasoning in the Multi-state Bar Exam

TL;DR

This work asks whether smaller open-weight LLMs can match legal reasoning on the MBEs with limited data. By distilling explanations into an IRAC framework and applying supervised fine-tuning with QLoRa adapters on Llama-2 7B and Llama-3 8B, the authors show notable gains in accuracy and parsing reliability, though not reaching GPT-4-level performance. Structured IRAC explanations yield stronger gains for Llama-3 and dramatically improve parse-robustness, while Llama-2 benefits more modestly and requires more data. The study releases the curated SFT dataset and adapter family, establishing practical lower bounds for resource-constrained legal QA and highlighting the trade-offs between model size, data, and reasoning structure in domain-specific fine-tuning.

Abstract

Legal reasoning tasks present unique challenges for large language models (LLMs) due to the complexity of domain-specific knowledge and reasoning processes. This paper investigates how effectively smaller language models (Llama 2 7B and Llama 3 8B) can be fine-tuned with a limited dataset of 1,514 Multi-state Bar Examination (MBE) questions to improve legal question answering accuracy. We evaluate these models on the 2022 MBE questions licensed from JD Advising, the same dataset used in the 'GPT-4 passes the Bar exam' study. Our methodology involves collecting approximately 200 questions per legal domain across 7 domains. We distill the dataset using Llama 3 (70B) to transform explanations into a structured IRAC (Issue, Rule, Application, Conclusion) format as a guided reasoning process to see if it results in better performance over the non-distilled dataset. We compare the non-fine-tuned models against their supervised fine-tuned (SFT) counterparts, trained for different sample sizes per domain, to study the effect on accuracy and prompt adherence. We also analyse option selection biases and their mitigation following SFT. In addition, we consolidate the performance across multiple variables: prompt type (few-shot vs zero-shot), answer ordering (chosen-option first vs generated-explanation first), response format (Numbered list vs Markdown vs JSON), and different decoding temperatures. Our findings show that domain-specific SFT helps some model configurations achieve close to human baseline performance, despite limited computational resources and a relatively small dataset. We release both the gathered SFT dataset and the family of Supervised Fine-tuned (SFT) adapters optimised for MBE performance. This establishes a practical lower bound on resources needed towards achieving effective legal question answering in smaller LLMs.

Paper Structure

This paper contains 56 sections, 1 equation, 12 figures, 10 tables.

Figures (12)

  • Figure 1: \ref{['subfig:learning_curve_comparison']} shows learning curves comparing Llama 2 and Llama 3 model performance as a function of training samples. \ref{['subfig:parsing_failures_comparison']} shows the reduction in parsing failures with increased fine-tuning samples, demonstrating that models rapidly adapt to the required response format, even with minimal training.
  • Figure 2: Llama 3 benefits from the structuring of the dataset whereas Llama 2 shows no clear benefit from the structuring.
  • Figure 3: As the number of added samples increases, the bias towards specific options (C for Llama 2 and D for Llama 3) decreases, indicating that we mitigate the bias as the number of training samples increases.
  • Figure 4: Extraction methodology using text processing to consolidate the questions and the solutions into one structure, which we can query individual elements from. The process remained the same for both the test and train datasets.
  • Figure 5: Data distillation done using Llama 3 70B to restructure the explanation into IRAC format.
  • ...and 7 more figures