Archimedes-AUEB at SemEval-2024 Task 5: LLM explains Civil Procedure
Odysseas S. Chlapanis, Ion Androutsopoulos, Dimitrios Galanis
TL;DR
This work tackles SemEval-2024 Task 5, which requires legal reasoning in Civil Procedure with explanations, by moving beyond pure classification. The authors fine-tune a small Llama-2-7B model on Chain-of-Thought explanations generated from a larger teacher LLM (GPT-3.5), aided by two data-augmentation techniques: Human-Guided Explanations (HGE) grounded in expert analyses and a Multiple Choice Mutation (MCM) method to create synthetic data. Results show that both augmentation strategies improve performance, with Llama-2-MCM outperforming GPT-3.5-CoT and achieving competitive ranks (15th out of 20), while providing explanations for its predictions. The work demonstrates that grounded, explainable reasoning can be transferred to smaller models, and it includes comprehensive ablations and qualitative analyses to illuminate strengths and failure modes, offering resources (data, prompts) for future research.
Abstract
The SemEval task on Argument Reasoning in Civil Procedure is challenging in that it requires understanding legal concepts and inferring complex arguments. Currently, most Large Language Models (LLM) excelling in the legal realm are principally purposed for classification tasks, hence their reasoning rationale is subject to contention. The approach we advocate involves using a powerful teacher-LLM (ChatGPT) to extend the training dataset with explanations and generate synthetic data. The resulting data are then leveraged to fine-tune a small student-LLM. Contrary to previous work, our explanations are not directly derived from the teacher's internal knowledge. Instead they are grounded in authentic human analyses, therefore delivering a superior reasoning signal. Additionally, a new `mutation' method generates artificial data instances inspired from existing ones. We are publicly releasing the explanations as an extension to the original dataset, along with the synthetic dataset and the prompts that were used to generate both. Our system ranked 15th in the SemEval competition. It outperforms its own teacher and can produce explanations aligned with the original human analyses, as verified by legal experts.
