Table of Contents
Fetching ...

EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting

Maria Kunilovskaya, Christina Pollkläsener

Abstract

This paper introduces an updated and combined version of the bidirectional English-German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based on the corpus, and presents a new illustrative study. The study validates the integrity of the rebuilt spoken data and evaluates probabilistic measures derived from base and fine-tuned GPT-2 and machine translation models on the task of filler particles prediction in interpreting.

EPIC-EuroParl-UdS: Information-Theoretic Perspectives on Translation and Interpreting

Abstract

This paper introduces an updated and combined version of the bidirectional English-German EPIC-UdS (spoken) and EuroParl-UdS (written) corpora containing original European Parliament speeches as well as their translations and interpretations. The new version corrects metadata and text errors identified through previous use, refines the content, updates linguistic annotations, and adds new layers, including word alignment and word-level surprisal indices. The combined resource is designed to support research using information-theoretic approaches to language variation, particularly studies comparing written and spoken modes, and examining disfluencies in speech, as well as traditional translationese studies, including parallel (source vs. target) and comparable (original vs. translated) analyses. The paper outlines the updates introduced in this release, summarises previous results based on the corpus, and presents a new illustrative study. The study validates the integrity of the rebuilt spoken data and evaluates probabilistic measures derived from base and fine-tuned GPT-2 and machine translation models on the task of filler particles prediction in interpreting.
Paper Structure (19 sections, 1 equation, 9 figures, 8 tables)

This paper contains 19 sections, 1 equation, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Word surprisals from base monolingual GPT-2 models and base MT model for a segment pair.
  • Figure 2: Structural overview of the EPIC-EuroParl-UdS repository, illustrating the organisation of spoken and written sub-corpora across data formats (long, wide, and vertical) and associated metadata.
  • Figure 3: Word surprisals from base GPT-2 using segment-bounded and sliding window approaches (seg_id: SI_EN_DE_129-27)
  • Figure 4: Relation between segment-bounded and sliding window base GPT-2 word surprisals for written and spoken target subcorpora by language. Lines represent LOWESS-smoothed fits (frac = 0.2).
  • Figure 5: Hugging Face cross-entropy loss on out-of-domain test sets.
  • ...and 4 more figures