On the Suitability of pre-trained foundational LLMs for Analysis in German Legal Education
Lorenz Wendlinger, Christian Braun, Abdullah Al Zubaer, Simon Alexander Nonn, Sarah Großkopf, Christofer Fellicious, Michael Granitzer
TL;DR
This study evaluates open-source foundational LLMs for German legal education, focusing on the Gutachtenstil appraisal framework, argument mining, and automated essay scoring. It compares GPT-3.5, Llama 3, Mixtral, and Jina Embeddings across four datasets, employing prompt strategies such as RAG and Chain-of-Thought to probe zero-shot and few-shot performance. Key findings show that while pre-trained LLMs capture German legal background, they struggle with complex tasks like Gutachtenstil integration and full legal opinions, though Retrieval Augmented Generation and few-shot prompting can boost performance in data-rich settings. The results suggest practical utility for simpler tasks and educational prompting, yet underscore limitations in language-specific reasoning, efficiency, and cross-domain transfer, informing future research on task-focused LLM deployment in legal education.
Abstract
We show that current open-source foundational LLMs possess instruction capability and German legal background knowledge that is sufficient for some legal analysis in an educational context. However, model capability breaks down in very specific tasks, such as the classification of "Gutachtenstil" appraisal style components, or with complex contexts, such as complete legal opinions. Even with extended context and effective prompting strategies, they cannot match the Bag-of-Words baseline. To combat this, we introduce a Retrieval Augmented Generation based prompt example selection method that substantially improves predictions in high data availability scenarios. We further evaluate the performance of pre-trained LLMs on two standard tasks for argument mining and automated essay scoring and find it to be more adequate. Throughout, pre-trained LLMs improve upon the baseline in scenarios with little or no labeled data with Chain-of-Thought prompting further helping in the zero-shot case.
