Table of Contents
Fetching ...

YT-30M: A multi-lingual multi-category dataset of YouTube comments

Hridoy Sankar Dutta

TL;DR

This work addresses the need for large-scale multilingual, multi-category data from YouTube comments to support cross-linguistic sentiment and societal analyses. It introduces YT-30M and its 100K sample (YT-100K), with multimodal metadata and category labels derived from channel information, and ensures PII redaction for privacy. Through analysis on the 100K subset, the paper reveals language and category distributions, engagement patterns via upvotes, sentiment diversity, and comment-length trends across categories, demonstrating the dataset's utility for sociolinguistic and NLP research. The public release on Hugging Face (YT-100K) and the option to access YT-30M enable broad adoption and benchmarking for multilingual and category-aware comment modeling, with potential impact on sentiment analysis, discourse understanding, and digital sociology studies.

Abstract

This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., 'News & Politics', 'Science & Technology', etc.).

YT-30M: A multi-lingual multi-category dataset of YouTube comments

TL;DR

This work addresses the need for large-scale multilingual, multi-category data from YouTube comments to support cross-linguistic sentiment and societal analyses. It introduces YT-30M and its 100K sample (YT-100K), with multimodal metadata and category labels derived from channel information, and ensures PII redaction for privacy. Through analysis on the 100K subset, the paper reveals language and category distributions, engagement patterns via upvotes, sentiment diversity, and comment-length trends across categories, demonstrating the dataset's utility for sociolinguistic and NLP research. The public release on Hugging Face (YT-100K) and the option to access YT-30M enable broad adoption and benchmarking for multilingual and category-aware comment modeling, with potential impact on sentiment analysis, discourse understanding, and digital sociology studies.

Abstract

This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., 'News & Politics', 'Science & Technology', etc.).

Paper Structure

This paper contains 4 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A YouTube comment from YT-30M
  • Figure 2: The main plot shows the proportion of languages detected in YouTube comments. The inset plot shows the proportion of YouTube categories.
  • Figure 3: Upvotes distribution for YouTube categories.
  • Figure 4: Analysis of our collected dataset.