Table of Contents
Fetching ...

YTLive: A Dataset of Real-World YouTube Live Streaming Sessions

Mojtaba Mozhganfar, Pooya Jamshidi, Seyyed Ali Aghamiri, Mohsen Ghasemi, Mahdi Dolati, Farzad Tashtarian, Ahmad Khonsari, Christian Timmerer

TL;DR

YTLive addresses the need for public, large-scale live-streaming data by introducing a comprehensive YouTube Live dataset collected in May–June 2024. It provides over 507,000 records across 12,156 streams with 5-minute granularity for concurrency and precise start/end times, enabled by an automated pipeline using the YouTube Data API. The paper presents initial analyses showing weekend and afternoon peaks, with shorter streams attracting larger and more stable audiences, and discusses implications for adaptive streaming, QoE modeling, and resource provisioning. By making the dataset openly available, the work supports reproducible research and system-level innovation in live streaming.

Abstract

Live streaming plays a major role in today's digital platforms, supporting entertainment, education, social media, etc. However, research in this field is limited by the lack of large, publicly available datasets that capture real-time viewer behavior at scale. To address this gap, we introduce YTLive, a public dataset focused on YouTube Live. Collected through the YouTube Researcher Program over May and June 2024, YTLive includes more than 507000 records from 12156 live streams, tracking concurrent viewer counts at five-minute intervals along with precise broadcast durations. We describe the dataset design and collection process and present an initial analysis of temporal viewing patterns. Results show that viewer counts are higher and more stable on weekends, especially during afternoon hours. Shorter streams attract larger and more consistent audiences, while longer streams tend to grow slowly and exhibit greater variability. These insights have direct implications for adaptive streaming, resource allocation, and Quality of Experience (QoE) modeling. YTLive offers a timely, open resource to support reproducible research and system-level innovation in live streaming. The dataset is publicly available at github.

YTLive: A Dataset of Real-World YouTube Live Streaming Sessions

TL;DR

YTLive addresses the need for public, large-scale live-streaming data by introducing a comprehensive YouTube Live dataset collected in May–June 2024. It provides over 507,000 records across 12,156 streams with 5-minute granularity for concurrency and precise start/end times, enabled by an automated pipeline using the YouTube Data API. The paper presents initial analyses showing weekend and afternoon peaks, with shorter streams attracting larger and more stable audiences, and discusses implications for adaptive streaming, QoE modeling, and resource provisioning. By making the dataset openly available, the work supports reproducible research and system-level innovation in live streaming.

Abstract

Live streaming plays a major role in today's digital platforms, supporting entertainment, education, social media, etc. However, research in this field is limited by the lack of large, publicly available datasets that capture real-time viewer behavior at scale. To address this gap, we introduce YTLive, a public dataset focused on YouTube Live. Collected through the YouTube Researcher Program over May and June 2024, YTLive includes more than 507000 records from 12156 live streams, tracking concurrent viewer counts at five-minute intervals along with precise broadcast durations. We describe the dataset design and collection process and present an initial analysis of temporal viewing patterns. Results show that viewer counts are higher and more stable on weekends, especially during afternoon hours. Shorter streams attract larger and more consistent audiences, while longer streams tend to grow slowly and exhibit greater variability. These insights have direct implications for adaptive streaming, resource allocation, and Quality of Experience (QoE) modeling. YTLive offers a timely, open resource to support reproducible research and system-level innovation in live streaming. The dataset is publicly available at github.

Paper Structure

This paper contains 8 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Daily and hourly box plots of mean and coefficient of variation of viewership: (a) Viewership by day of week, (b) viewership by hour of day, (c) coefficient of variation of viewership by day of week, and (d) coefficient of variation by hour of day.
  • Figure 2: Heatmaps of mean and coefficient of variation of viewership across the week.
  • Figure 3: Viewer metrics by category: boxplots of mean and variability of concurrent viewers across video lengths and day types.
  • Figure 4: CDF of video durations across categories
  • Figure 5: Effect of scheduling on the initial number of viewers
  • ...and 1 more figures