Table of Contents
Fetching ...

A Content-Driven Micro-Video Recommendation Dataset at Scale

Yongxin Ni, Yu Cheng, Xiangyan Liu, Junchen Fu, Youhua Li, Xiangnan He, Yongfeng Zhang, Fajie Yuan

TL;DR

MicroLens delivers the largest public multimodal micro-video dataset to date, pairing 1B user-item interactions with rich raw video modalities to support content-driven recommender research. Through extensive benchmarking of ID-based, VIDRec, and end-to-end VideoRec models, the work demonstrates that end-to-end learning from raw video content yields the strongest recommendation performance, while pre-extracted features offer limited or inconsistent gains in non-cold settings. The paper also provides a thorough analysis of how video understanding knowledge transfers to recommendation, revealing that retrained video encoders outperform frozen ones and that semantic representations from video classification are not universally transferable to recommendation tasks. Overall, MicroLens is positioned as a foundation for multimodal RS development and a potential pre-training resource for future universal recommender systems.

Abstract

Micro-videos have recently gained immense popularity, sparking critical research in micro-video recommendation with significant implications for the entertainment, advertising, and e-commerce industries. However, the lack of large-scale public micro-video datasets poses a major challenge for developing effective recommender systems. To address this challenge, we introduce a very large micro-video recommendation dataset, named "MicroLens", consisting of one billion user-item interaction behaviors, 34 million users, and one million micro-videos. This dataset also contains various raw modality information about videos, including titles, cover images, audio, and full-length videos. MicroLens serves as a benchmark for content-driven micro-video recommendation, enabling researchers to utilize various modalities of video information for recommendation, rather than relying solely on item IDs or off-the-shelf video features extracted from a pre-trained network. Our benchmarking of multiple recommender models and video encoders on MicroLens has yielded valuable insights into the performance of micro-video recommendation. We believe that this dataset will not only benefit the recommender system community but also promote the development of the video understanding field. Our datasets and code are available at https://github.com/westlake-repl/MicroLens.

A Content-Driven Micro-Video Recommendation Dataset at Scale

TL;DR

MicroLens delivers the largest public multimodal micro-video dataset to date, pairing 1B user-item interactions with rich raw video modalities to support content-driven recommender research. Through extensive benchmarking of ID-based, VIDRec, and end-to-end VideoRec models, the work demonstrates that end-to-end learning from raw video content yields the strongest recommendation performance, while pre-extracted features offer limited or inconsistent gains in non-cold settings. The paper also provides a thorough analysis of how video understanding knowledge transfers to recommendation, revealing that retrained video encoders outperform frozen ones and that semantic representations from video classification are not universally transferable to recommendation tasks. Overall, MicroLens is positioned as a foundation for multimodal RS development and a potential pre-training resource for future universal recommender systems.

Abstract

Micro-videos have recently gained immense popularity, sparking critical research in micro-video recommendation with significant implications for the entertainment, advertising, and e-commerce industries. However, the lack of large-scale public micro-video datasets poses a major challenge for developing effective recommender systems. To address this challenge, we introduce a very large micro-video recommendation dataset, named "MicroLens", consisting of one billion user-item interaction behaviors, 34 million users, and one million micro-videos. This dataset also contains various raw modality information about videos, including titles, cover images, audio, and full-length videos. MicroLens serves as a benchmark for content-driven micro-video recommendation, enabling researchers to utilize various modalities of video information for recommendation, rather than relying solely on item IDs or off-the-shelf video features extracted from a pre-trained network. Our benchmarking of multiple recommender models and video encoders on MicroLens has yielded valuable insights into the performance of micro-video recommendation. We believe that this dataset will not only benefit the recommender system community but also promote the development of the video understanding field. Our datasets and code are available at https://github.com/westlake-repl/MicroLens.
Paper Structure (25 sections, 7 figures, 12 tables)

This paper contains 25 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Dataset construction pipeline.
  • Figure 2: Item examples in MicroLens.
  • Figure 3: Statistics of MicroLens-100K.
  • Figure 4: Video recommendation accuracy (bar charts) vs. video classification accuracy (purple line). Frozen means that the video encoder is fixed without parameter update, topT means that only the top few layers of the video encoder are fine-tuned, and FT means full parameters are fine-tuned.
  • Figure 5: Ablation study of video encoders. (d) "WT" refers to the video encoders in SASRec$_{\rm V}$ have pre-trained weights from the video classification task, while "OT" denotes that they are randomly initialized. (b) (c) (d) are performance change by adding DNN layers on top of three frozen encoders.
  • ...and 2 more figures