Table of Contents
Fetching ...

Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus

Benjamin Litterer, David Jurgens, Dallas Card

TL;DR

SPoRC delivers a large-scale, open multimodal podcast corpus (1.1M English-language episodes from May–June 2020) with transcripts, audio features, and inferred speaker roles, enabling computational analyses of content, networks, and responsiveness. The authors detail an end-to-end data pipeline (collection, transcription with Whisper, prosody with openSMILE, diarization via pyannome, and host/guest labeling with RoBERTa) and release two dataset formats. Through topic modeling and a guest-driven social network, they uncover coherent topical communities, category-linked network structure, and rapid but varying collective attention to events like George Floyd, highlighting diffusion patterns beyond traditional news media. SPoRC thus provides a foundational resource for research into community identity, information diffusion, and incidental exposure in long-form audio media, with clear paths for expansion and replication.

Abstract

Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.

Mapping the Podcast Ecosystem with the Structured Podcast Research Corpus

TL;DR

SPoRC delivers a large-scale, open multimodal podcast corpus (1.1M English-language episodes from May–June 2020) with transcripts, audio features, and inferred speaker roles, enabling computational analyses of content, networks, and responsiveness. The authors detail an end-to-end data pipeline (collection, transcription with Whisper, prosody with openSMILE, diarization via pyannome, and host/guest labeling with RoBERTa) and release two dataset formats. Through topic modeling and a guest-driven social network, they uncover coherent topical communities, category-linked network structure, and rapid but varying collective attention to events like George Floyd, highlighting diffusion patterns beyond traditional news media. SPoRC thus provides a foundational resource for research into community identity, information diffusion, and incidental exposure in long-form audio media, with clear paths for expansion and replication.

Abstract

Podcasts provide highly diverse content to a massive listener base through a unique on-demand modality. However, limited data has prevented large-scale computational analysis of the podcast ecosystem. To fill this gap, we introduce a massive dataset of over 1.1M podcast transcripts that is largely comprehensive of all English language podcasts available through public RSS feeds from May and June of 2020. This data is not limited to text, but rather includes audio features and speaker turns for a subset of 370K episodes, and speaker role inferences and other metadata for all 1.1M episodes. Using this data, we also conduct a foundational investigation into the content, structure, and responsiveness of this ecosystem. Together, our data and analyses open the door to continued computational research of this popular and impactful medium.

Paper Structure

This paper contains 38 sections, 1 equation, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Many topics are strongly associated with a single category. However, a number of topics such as "Black, Lives, Matter", and "Life, Success, Goals" cut across categories. Here, this is depicted using a sample of 25K episodes, colored by category, and projected using t-SNE on episodes' topic distributions to visualize topical distance, with select topic clusters annotated using the top words in the corresponding topic.
  • Figure 2: Business, Sports, and News have densely connected guest networks, with other categories being more diffuse. Edges in this network connect podcasts that share one or more common guests. Nodes represent podcasts, with color mapped to category, and node size indicating a podcast's total number of shared guests.
  • Figure 3: The murder of George Floyd triggers a fast and widespread discussion of racial justice in the podcast ecosystem. On the left, we plot a three day rolling average of the topic percentages across all transcripts. On the right, we plot a three day rolling average over the percentage of episodes where the name George Floyd was said. Shaded bands represent 95% confidence intervals.
  • Figure 4: Estimated word error rates from Whisper, based on comparison to professional transcripts for six episodes from each of six shows. Vertical bars show averages across episodes. The low error rates for the performed monologues featured on Welcome to Night Vale hint at the fact that many of these apparent errors are actually due to the way most professional transcripts are edited to remove disfluencies in speech (see below).
  • Figure 5: Manually estimated word error rates from Whisper, based on listening to the first five content minutes of three episodes from each of six shows. Vertical bars show averages across episodes. The high error rate for one episode of This American Life is primarily due to Whisper's failure to transcribe the Spanish words in that episode.
  • ...and 7 more figures