Table of Contents
Fetching ...

A Large-scale Dataset with Behavior, Attributes, and Content of Mobile Short-video Platform

Yu Shang, Chen Gao, Nian Li, Yong Li

TL;DR

This work addresses the lack of publicly accessible, multimodal short-video datasets by introducing a large-scale dataset collected from a real mobile platform, encompassing 10,000 volunteers, 1,019,568 interactions, and 153,561 videos with rich behavior, attribute, and content data. It provides preprocessing features for video content, bilingual ASR, and extensive attribute coverage while ensuring privacy and consent. The authors validate the dataset through data richness, content quality assessments, eight-recipe benchmarking across multimodal recommender systems, and a study of filter bubbles, illustrating practical utility for user modeling, social science, and AI research. The release of data and code aims to support both academic and industrial exploration, with plans to add finer-grained content and longer interaction histories in the future.

Abstract

Short-video platforms show an increasing impact on people's daily lives nowadays, with billions of active users spending plenty of time each day. The interactions between users and online platforms give rise to many scientific problems across computational social science and artificial intelligence. However, despite the rapid development of short-video platforms, currently there are serious shortcomings in existing relevant datasets on three aspects: inadequate user-video feedback, limited user attributes and lack of video content. To address these problems, we provide a large-scale dataset with rich user behavior, attributes and video content from a real mobile short-video platform. This dataset covers 10,000 voluntary users and 153,561 videos, and we conduct four-fold technical validations of the dataset. First, we verify the richness of the behavior and attribute data. Second, we confirm the representing ability of the content features. Third, we provide benchmarking results on recommendation algorithms with our dataset. Finally, we explore the filter bubble phenomenon on the platform using the dataset. We believe the dataset could support the broad research community, including but not limited to user modeling, social science, human behavior understanding, etc. The dataset and code is available at https://github.com/tsinghua-fib-lab/ShortVideo_dataset.

A Large-scale Dataset with Behavior, Attributes, and Content of Mobile Short-video Platform

TL;DR

This work addresses the lack of publicly accessible, multimodal short-video datasets by introducing a large-scale dataset collected from a real mobile platform, encompassing 10,000 volunteers, 1,019,568 interactions, and 153,561 videos with rich behavior, attribute, and content data. It provides preprocessing features for video content, bilingual ASR, and extensive attribute coverage while ensuring privacy and consent. The authors validate the dataset through data richness, content quality assessments, eight-recipe benchmarking across multimodal recommender systems, and a study of filter bubbles, illustrating practical utility for user modeling, social science, and AI research. The release of data and code aims to support both academic and industrial exploration, with plans to add finer-grained content and longer interaction histories in the future.

Abstract

Short-video platforms show an increasing impact on people's daily lives nowadays, with billions of active users spending plenty of time each day. The interactions between users and online platforms give rise to many scientific problems across computational social science and artificial intelligence. However, despite the rapid development of short-video platforms, currently there are serious shortcomings in existing relevant datasets on three aspects: inadequate user-video feedback, limited user attributes and lack of video content. To address these problems, we provide a large-scale dataset with rich user behavior, attributes and video content from a real mobile short-video platform. This dataset covers 10,000 voluntary users and 153,561 videos, and we conduct four-fold technical validations of the dataset. First, we verify the richness of the behavior and attribute data. Second, we confirm the representing ability of the content features. Third, we provide benchmarking results on recommendation algorithms with our dataset. Finally, we explore the filter bubble phenomenon on the platform using the dataset. We believe the dataset could support the broad research community, including but not limited to user modeling, social science, human behavior understanding, etc. The dataset and code is available at https://github.com/tsinghua-fib-lab/ShortVideo_dataset.

Paper Structure

This paper contains 14 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: The illustration of user interface and behaviors on the platform (a) and an overview of the dataset (b).
  • Figure 2: Interaction number distribution of (a) users and (b) videos.
  • Figure 3: Distribution of some key fields in user attributes.
  • Figure 4: Embedding visualization of videos with different (a) Category I and (b) Category III through t-SNE.
  • Figure 5: Analysis of the filter bubble ratio of active users (a) and inactive users (b) over time in our dataset.