Table of Contents
Fetching ...

Multi-Modal Video Feature Extraction for Popularity Prediction

Haixu Liu, Wenning Wang, Haoxiang Zheng, Penghao Jiang, Qirui Wang, Ruiqing Yan, Qiuzhuang Sun

TL;DR

This work tackles the prediction of short-video popularity by leveraging a diverse set of modalities, including multiple video feature backbones and text-derived representations from caption-driven prompts. The authors combine four video-feature vectors (TimeSformer, ViViT, VideoMAE, X-CLIP) with two text-derived vectors produced by caption-based models (LLaVA-NeXT and InternVideo2) encoded via BERT, and fuse these with engineered tabular features. Four metric-specific neural networks are trained alongside an XGBoost model, with their predictions averaged to yield final estimates for four engagement metrics, evaluated by mean absolute percentage error (MAPE). Key findings include the complementary behavior of neural networks and tree ensembles, the effectiveness of hashtag/mention features and time-based features, and superior performance of X-CLIP for video features, culminating in first-place leaderboard results. The approach offers a practical framework for multimodal popularity prediction with potential applications for content creators and recommendation systems.

Abstract

This work aims to predict the popularity of short videos using the videos themselves and their related features. Popularity is measured by four key engagement metrics: view count, like count, comment count, and share count. This study employs video classification models with different architectures and training methods as backbone networks to extract video modality features. Meanwhile, the cleaned video captions are incorporated into a carefully designed prompt framework, along with the video, as input for video-to-text generation models, which generate detailed text-based video content understanding. These texts are then encoded into vectors using a pre-trained BERT model. Based on the six sets of vectors mentioned above, a neural network is trained for each of the four prediction metrics. Moreover, the study conducts data mining and feature engineering based on the video and tabular data, constructing practical features such as the total frequency of hashtag appearances, the total frequency of mention appearances, video duration, frame count, frame rate, and total time online. Multiple machine learning models are trained, and the most stable model, XGBoost, is selected. Finally, the predictions from the neural network and XGBoost models are averaged to obtain the final result.

Multi-Modal Video Feature Extraction for Popularity Prediction

TL;DR

This work tackles the prediction of short-video popularity by leveraging a diverse set of modalities, including multiple video feature backbones and text-derived representations from caption-driven prompts. The authors combine four video-feature vectors (TimeSformer, ViViT, VideoMAE, X-CLIP) with two text-derived vectors produced by caption-based models (LLaVA-NeXT and InternVideo2) encoded via BERT, and fuse these with engineered tabular features. Four metric-specific neural networks are trained alongside an XGBoost model, with their predictions averaged to yield final estimates for four engagement metrics, evaluated by mean absolute percentage error (MAPE). Key findings include the complementary behavior of neural networks and tree ensembles, the effectiveness of hashtag/mention features and time-based features, and superior performance of X-CLIP for video features, culminating in first-place leaderboard results. The approach offers a practical framework for multimodal popularity prediction with potential applications for content creators and recommendation systems.

Abstract

This work aims to predict the popularity of short videos using the videos themselves and their related features. Popularity is measured by four key engagement metrics: view count, like count, comment count, and share count. This study employs video classification models with different architectures and training methods as backbone networks to extract video modality features. Meanwhile, the cleaned video captions are incorporated into a carefully designed prompt framework, along with the video, as input for video-to-text generation models, which generate detailed text-based video content understanding. These texts are then encoded into vectors using a pre-trained BERT model. Based on the six sets of vectors mentioned above, a neural network is trained for each of the four prediction metrics. Moreover, the study conducts data mining and feature engineering based on the video and tabular data, constructing practical features such as the total frequency of hashtag appearances, the total frequency of mention appearances, video duration, frame count, frame rate, and total time online. Multiple machine learning models are trained, and the most stable model, XGBoost, is selected. Finally, the predictions from the neural network and XGBoost models are averaged to obtain the final result.
Paper Structure (8 sections, 6 figures, 2 tables)

This paper contains 8 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The workflow of our model.
  • Figure 2: Comparison of the distributions of the outputs of the training and test sets of neural networks and XGBoost
  • Figure 3: Feature Importance of comment prediction
  • Figure 4: Feature Importanceof of heart prediction
  • Figure 5: Feature Importance of play prediction
  • ...and 1 more figures