Table of Contents
Fetching ...

Dataset of Quotation Attribution in German News Articles

Fynn Petersen-Frey, Chris Biemann

TL;DR

The paper tackles the lack of German quotation attribution data in news articles by launching a freely available CC-licensed dataset built from 1000 WIKINEWS German articles, annotated with a fine-grained schema that covers Speaker, Cue, Frame, Addressee, Quotation Type, and Quote Media. It details the annotation schema, data creation workflow, and an evaluation using two baselines (Rule-based and Citron) to establish baseline performance and highlight the need for machine learning models tailored to multi-span, cross-sentence quotations. The analysis reveals long-tailed quotation types, multi-sentence quotes, and complex role attachments, while the baselines show precision-leaning results and relatively low recall, underscoring the dataset’s value for developing higher-recall systems. Overall, the resource enables robust training and evaluation of German quotation attribution systems for downstream applications in digital humanities and Computational Social Science.

Abstract

Extracting who says what to whom is a crucial part in analyzing human communication in today's abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.

Dataset of Quotation Attribution in German News Articles

TL;DR

The paper tackles the lack of German quotation attribution data in news articles by launching a freely available CC-licensed dataset built from 1000 WIKINEWS German articles, annotated with a fine-grained schema that covers Speaker, Cue, Frame, Addressee, Quotation Type, and Quote Media. It details the annotation schema, data creation workflow, and an evaluation using two baselines (Rule-based and Citron) to establish baseline performance and highlight the need for machine learning models tailored to multi-span, cross-sentence quotations. The analysis reveals long-tailed quotation types, multi-sentence quotes, and complex role attachments, while the baselines show precision-leaning results and relatively low recall, underscoring the dataset’s value for developing higher-recall systems. Overall, the resource enables robust training and evaluation of German quotation attribution systems for downstream applications in digital humanities and Computational Social Science.

Abstract

Extracting who says what to whom is a crucial part in analyzing human communication in today's abundance of data such as online news articles. Yet, the lack of annotated data for this task in German news articles severely limits the quality and usability of possible systems. To remedy this, we present a new, freely available, creative-commons-licensed dataset for quotation attribution in German news articles based on WIKINEWS. The dataset provides curated, high-quality annotations across 1000 documents (250,000 tokens) in a fine-grained annotation schema enabling various downstream uses for the dataset. The annotations not only specify who said what but also how, in which context, to whom and define the type of quotation. We specify our annotation schema, describe the creation of the dataset and provide a quantitative analysis. Further, we describe suitable evaluation metrics, apply two existing systems for quotation attribution, discuss their results to evaluate the utility of our dataset and outline use cases of our dataset in downstream tasks.
Paper Structure (40 sections, 3 figures, 4 tables)

This paper contains 40 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Quote token length histograms
  • Figure 2: Role token length histograms
  • Figure 3: Number of Quotes per role span

Theorems & Definitions (5)

  • Example 1.1: Direct
  • Example 1.2: Indirect
  • Example 1.3: Reported
  • Example 1.4: Free Indirect
  • Example 1.5: Indirect/Free Indirect