CourtPressGER: A German Court Decision to Press Release Summarization Dataset

Sebastian Nagl; Mohamed Elganayni; Melanie Pospisil; Matthias Grabmair

CourtPressGER: A German Court Decision to Press Release Summarization Dataset

Sebastian Nagl, Mohamed Elganayni, Melanie Pospisil, Matthias Grabmair

TL;DR

This work introduces CourtPressGER, a large German court decision to press release summarization dataset and evaluation framework, bridging the gap between technical judicial summaries and citizen-oriented communication. It provides 6.4k aligned rulings, official press releases, and synthetic prompts, enabling multi-LLM benchmarking across hierarchical and full-document settings. The study uses automatic metrics, factual-consistency checks, LLM-based judgments, and human evaluation to assess performance, finding that larger models excel while hierarchical summarization helps smaller models cope with long documents, though factual consistency remains challenging. Human-drafted press releases still top expert rankings, underscoring the need for improving automatic and human-in-the-loop evaluation in German legal summarization. The dataset and framework offer a valuable resource for advancing transparent, accessible court communication in German–a step toward broader public understanding of judicial decisions.

Abstract

Official court press releases from Germany's highest courts present and explain judicial rulings to the public, as well as to expert audiences. Prior NLP efforts emphasize technical headnotes, ignoring citizen-oriented communication needs. We introduce CourtPressGER, a 6.4k dataset of triples: rulings, human-drafted press releases, and synthetic prompts for LLMs to generate comparable releases. This benchmark trains and evaluates LLMs in generating accurate, readable summaries from long judicial texts. We benchmark small and large LLMs using reference-based metrics, factual-consistency checks, LLM-as-judge, and expert ranking. Large LLMs produce high-quality drafts with minimal hierarchical performance loss; smaller models require hierarchical setups for long judgments. Initial benchmarks show varying model performance, with human-drafted releases ranking highest.

CourtPressGER: A German Court Decision to Press Release Summarization Dataset

TL;DR

Abstract

CourtPressGER: A German Court Decision to Press Release Summarization Dataset

TL;DR

Abstract

Paper Structure

Table of Contents