PolInterviews -- A Dataset of German Politician Public Broadcast Interviews
Lukas Birkenmaier, Laureen Sieber, Felix Bergstein
TL;DR
Problem addressed: the scarcity of high-quality, German-language transcripts of political interviews suitable for computational analysis. The paper introduces PolInterviews, a curated dataset of 99 public-broadcast interviews with 33 German politicians (2020–2024) across five formats, with 28,146 sentences, transcribed and labeled for speaker identity. Methodologically, it combines Whisper-based transcription, ECAPA-TDNN diarization with agglomerative clustering, and manual validation to produce timestamped, speaker-tagged transcripts. Key contributions include the open, tidy data format, detailed variable schema, and potential for analysis of agenda-setting, self-presentation, and interviewer–interviewee dynamics. The dataset enables rigorous political-communication research in Germany and supports cross-dataset integration.
Abstract
This paper presents a novel dataset of public broadcast interviews featuring high-ranking German politicians. The interviews were sourced from YouTube, transcribed, processed for speaker identification, and stored in a tidy and open format. The dataset comprises 99 interviews with 33 different German politicians across five major interview formats, containing a total of 28,146 sentences. As the first of its kind, this dataset offers valuable opportunities for research on various aspects of political communication in the (German) political contexts, such as agenda-setting, interviewer dynamics, or politicians' self-presentation.
