Table of Contents
Fetching ...

Short Video Segment-level User Dynamic Interests Modeling in Personalized Recommendation

Zhiyu He, Zhixin Ling, Jiayu Li, Zhiqiang Guo, Weizhi Ma, Xinchen Luo, Min Zhang, Guorui Zhou

TL;DR

This work tackles the limitation of traditional video-level recommender systems by modeling dynamic user interests at the segment level within short videos. It introduces a three-component model consisting of Hybrid User & Video Representation, a Multi-modal User-video Encoder with User–Video Cross-attention, and a Segment Interest Decoder to produce per-segment scores $\,\vec{p}\,$; training leverages an intra-video loss and a position bias to capture temporal attention patterns. The approach is evaluated on two downstream tasks—video-skip prediction and short video recommendation—across SegMM and KuaiRand, with substantial improvements over diverse baselines and a newly released SegMM dataset that includes segment-level data. Case studies illustrate practical benefits such as personalized thumbnails and targeted editing, underscoring the potential for deeper personalization in short-video experiences. Overall, the study demonstrates that segment-level modeling provides richer signals about user engagement and can meaningfully enhance both predictive accuracy and user satisfaction in streaming contexts.

Abstract

The rapid growth of short videos has necessitated effective recommender systems to match users with content tailored to their evolving preferences. Current video recommendation models primarily treat each video as a whole, overlooking the dynamic nature of user preferences with specific video segments. In contrast, our research focuses on segment-level user interest modeling, which is crucial for understanding how users' preferences evolve during video browsing. To capture users' dynamic segment interests, we propose an innovative model that integrates a hybrid representation module, a multi-modal user-video encoder, and a segment interest decoder. Our model addresses the challenges of capturing dynamic interest patterns, missing segment-level labels, and fusing different modalities, achieving precise segment-level interest prediction. We present two downstream tasks to evaluate the effectiveness of our segment interest modeling approach: video-skip prediction and short video recommendation. Our experiments on real-world short video datasets with diverse modalities show promising results on both tasks. It demonstrates that segment-level interest modeling brings a deep understanding of user engagement and enhances video recommendations. We also release a unique dataset that includes segment-level video data and diverse user behaviors, enabling further research in segment-level interest modeling. This work pioneers a novel perspective on understanding user segment-level preference, offering the potential for more personalized and engaging short video experiences.

Short Video Segment-level User Dynamic Interests Modeling in Personalized Recommendation

TL;DR

This work tackles the limitation of traditional video-level recommender systems by modeling dynamic user interests at the segment level within short videos. It introduces a three-component model consisting of Hybrid User & Video Representation, a Multi-modal User-video Encoder with User–Video Cross-attention, and a Segment Interest Decoder to produce per-segment scores ; training leverages an intra-video loss and a position bias to capture temporal attention patterns. The approach is evaluated on two downstream tasks—video-skip prediction and short video recommendation—across SegMM and KuaiRand, with substantial improvements over diverse baselines and a newly released SegMM dataset that includes segment-level data. Case studies illustrate practical benefits such as personalized thumbnails and targeted editing, underscoring the potential for deeper personalization in short-video experiences. Overall, the study demonstrates that segment-level modeling provides richer signals about user engagement and can meaningfully enhance both predictive accuracy and user satisfaction in streaming contexts.

Abstract

The rapid growth of short videos has necessitated effective recommender systems to match users with content tailored to their evolving preferences. Current video recommendation models primarily treat each video as a whole, overlooking the dynamic nature of user preferences with specific video segments. In contrast, our research focuses on segment-level user interest modeling, which is crucial for understanding how users' preferences evolve during video browsing. To capture users' dynamic segment interests, we propose an innovative model that integrates a hybrid representation module, a multi-modal user-video encoder, and a segment interest decoder. Our model addresses the challenges of capturing dynamic interest patterns, missing segment-level labels, and fusing different modalities, achieving precise segment-level interest prediction. We present two downstream tasks to evaluate the effectiveness of our segment interest modeling approach: video-skip prediction and short video recommendation. Our experiments on real-world short video datasets with diverse modalities show promising results on both tasks. It demonstrates that segment-level interest modeling brings a deep understanding of user engagement and enhances video recommendations. We also release a unique dataset that includes segment-level video data and diverse user behaviors, enabling further research in segment-level interest modeling. This work pioneers a novel perspective on understanding user segment-level preference, offering the potential for more personalized and engaging short video experiences.

Paper Structure

This paper contains 30 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Users' dynamic interests in video segments reflect their diverse preferences, offering deeper insight than the overall video preference. Such interests benefit downstream applications such as video-skip prediction, recommendations, and personalized homepage thumbnails.
  • Figure 2: Overview of user segment interest modeling with hybrid user and video representation, multi-modal user-video encoder, and segment interest decoder. $N$ is the number of segments in the target video, and segment interest scores are the model's output.
  • Figure 3: Details of the interest detection module in the multi-modal encoder. $U$ and $V$ denote the representations of the user and target video.
  • Figure 4: Ablation study of attention mechanism, segment position indices, and replacing our loss with BCE (Binary Cross-Entropy) loss.
  • Figure 5: SegRec: Segment-integrated Video Recommendation Framework. The parameter of segment interest modeling is frozen, serving segment interest scores $\vec{p}=(p_1,p_2,..p_N)$ to down-streaming video recommendation.
  • ...and 1 more figures