Multi-Modal Video Feature Extraction for Popularity Prediction
Haixu Liu, Wenning Wang, Haoxiang Zheng, Penghao Jiang, Qirui Wang, Ruiqing Yan, Qiuzhuang Sun
TL;DR
This work tackles the prediction of short-video popularity by leveraging a diverse set of modalities, including multiple video feature backbones and text-derived representations from caption-driven prompts. The authors combine four video-feature vectors (TimeSformer, ViViT, VideoMAE, X-CLIP) with two text-derived vectors produced by caption-based models (LLaVA-NeXT and InternVideo2) encoded via BERT, and fuse these with engineered tabular features. Four metric-specific neural networks are trained alongside an XGBoost model, with their predictions averaged to yield final estimates for four engagement metrics, evaluated by mean absolute percentage error (MAPE). Key findings include the complementary behavior of neural networks and tree ensembles, the effectiveness of hashtag/mention features and time-based features, and superior performance of X-CLIP for video features, culminating in first-place leaderboard results. The approach offers a practical framework for multimodal popularity prediction with potential applications for content creators and recommendation systems.
Abstract
This work aims to predict the popularity of short videos using the videos themselves and their related features. Popularity is measured by four key engagement metrics: view count, like count, comment count, and share count. This study employs video classification models with different architectures and training methods as backbone networks to extract video modality features. Meanwhile, the cleaned video captions are incorporated into a carefully designed prompt framework, along with the video, as input for video-to-text generation models, which generate detailed text-based video content understanding. These texts are then encoded into vectors using a pre-trained BERT model. Based on the six sets of vectors mentioned above, a neural network is trained for each of the four prediction metrics. Moreover, the study conducts data mining and feature engineering based on the video and tabular data, constructing practical features such as the total frequency of hashtag appearances, the total frequency of mention appearances, video duration, frame count, frame rate, and total time online. Multiple machine learning models are trained, and the most stable model, XGBoost, is selected. Finally, the predictions from the neural network and XGBoost models are averaged to obtain the final result.
