Table of Contents
Fetching ...

Forgetful by Design? A Critical Audit of YouTube's Search API for Academic Research

Bernhard Rieder, Adrian Padilla, Oscar Coromina

TL;DR

The paper poses a critical question about the reliability of YouTube's search API for scholarly work. It applies a maximalist, day-by-day sampling approach over six months across eleven queries to compare ranking modes (date vs relevance) and assess temporal and reproducibility challenges. The findings show strong recency bias, rapid decay in retrievable content, and considerable non-replicability of search results, which together threaten the validity of longitudinal and historical analyses. The authors propose methodological workarounds and call for API improvements or separate researcher-focused endpoints to better support evidence-based research and compliance with the Digital Services Act.

Abstract

This paper critically audits the search endpoint of YouTube's Data API (v3), a common tool for academic research. Through systematic weekly searches over six months using eleven queries, we identify major limitations regarding completeness, representativeness, consistency, and bias. Our findings reveal substantial differences between ranking parameters like relevance and date in terms of video recall and precision, with relevance often retrieving numerous off-topic videos. We also observe severe temporal decay in video discoverability: the number of retrievable videos for a given period drops dramatically within just 20-60 days of publication, even though these videos remain on the platform. This potentially undermines research designs that rely on systematic data collection. Furthermore, search results lack consistency, with identical queries yielding different video sets over time, compromising replicability. A case study on the European Parliament elections highlights how these issues impact research outcomes. While the paper offers several mitigation strategies, it concludes that the API's search function, potentially prioritizing 'freshness' over comprehensive retrieval, is not adequate for robust academic research, especially concerning Digital Services Act requirements.

Forgetful by Design? A Critical Audit of YouTube's Search API for Academic Research

TL;DR

The paper poses a critical question about the reliability of YouTube's search API for scholarly work. It applies a maximalist, day-by-day sampling approach over six months across eleven queries to compare ranking modes (date vs relevance) and assess temporal and reproducibility challenges. The findings show strong recency bias, rapid decay in retrievable content, and considerable non-replicability of search results, which together threaten the validity of longitudinal and historical analyses. The authors propose methodological workarounds and call for API improvements or separate researcher-focused endpoints to better support evidence-based research and compliance with the Digital Services Act.

Abstract

This paper critically audits the search endpoint of YouTube's Data API (v3), a common tool for academic research. Through systematic weekly searches over six months using eleven queries, we identify major limitations regarding completeness, representativeness, consistency, and bias. Our findings reveal substantial differences between ranking parameters like relevance and date in terms of video recall and precision, with relevance often retrieving numerous off-topic videos. We also observe severe temporal decay in video discoverability: the number of retrievable videos for a given period drops dramatically within just 20-60 days of publication, even though these videos remain on the platform. This potentially undermines research designs that rely on systematic data collection. Furthermore, search results lack consistency, with identical queries yielding different video sets over time, compromising replicability. A case study on the European Parliament elections highlights how these issues impact research outcomes. While the paper offers several mitigation strategies, it concludes that the API's search function, potentially prioritizing 'freshness' over comprehensive retrieval, is not adequate for robust academic research, especially concerning Digital Services Act requirements.

Paper Structure

This paper contains 13 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Top two: Videos per day, published from October 15, 2023 onwards, for the queries 'Ukraine war' (top) and 'Mukbang' (middle), searched on April 23, 2024 and November 4, 2024.
  • Figure 2: Number of videos published between April 20, 2024 and May 4, 2024 for all queries, searched on six separate dates with both relevance and date ranking.
  • Figure 3: New and old videos published between May 7 and May 13, 2024 for searches over three consecutive weeks.
  • Figure 4: Videos per day for five searches over the same observation period, with keyword filtering.