Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus
Benjamin Litterer, David Jurgens, Dallas Card
TL;DR
SPoRC delivers a large-scale, open multimodal podcast corpus (1.1M English-language episodes from May–June 2020) with transcripts, audio features, and inferred speaker roles, enabling computational analyses of content, networks, and responsiveness. The authors detail an end-to-end data pipeline (collection, transcription with Whisper, prosody with openSMILE, diarization via pyannome, and host/guest labeling with RoBERTa) and release two dataset formats. Through topic modeling and a guest-driven social network, they uncover coherent topical communities, category-linked network structure, and rapid but varying collective attention to events like George Floyd, highlighting diffusion patterns beyond traditional news media. SPoRC thus provides a foundational resource for research into community identity, information diffusion, and incidental exposure in long-form audio media, with clear paths for expansion and replication.
Abstract
Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.
