YT-30M: A multi-lingual multi-category dataset of YouTube comments
Hridoy Sankar Dutta
TL;DR
This work addresses the need for large-scale multilingual, multi-category data from YouTube comments to support cross-linguistic sentiment and societal analyses. It introduces YT-30M and its 100K sample (YT-100K), with multimodal metadata and category labels derived from channel information, and ensures PII redaction for privacy. Through analysis on the 100K subset, the paper reveals language and category distributions, engagement patterns via upvotes, sentiment diversity, and comment-length trends across categories, demonstrating the dataset's utility for sociolinguistic and NLP research. The public release on Hugging Face (YT-100K) and the option to access YT-30M enable broad adoption and benchmarking for multilingual and category-aware comment modeling, with potential impact on sentiment analysis, discourse understanding, and digital sociology studies.
Abstract
This paper introduces two large-scale multilingual comment datasets, YT-30M (and YT-100K) from YouTube. The analysis in this paper is performed on a smaller sample (YT-100K) of YT-30M. Both the datasets: YT-30M (full) and YT-100K (randomly selected 100K sample from YT-30M) are publicly released for further research. YT-30M (YT-100K) contains 32236173 (108694) comments posted by YouTube channel that belong to YouTube categories. Each comment is associated with a video ID, comment ID, commentor name, commentor channel ID, comment text, upvotes, original channel ID and category of the YouTube channel (e.g., 'News & Politics', 'Science & Technology', etc.).
