Table of Contents
Fetching ...

Dreaming User Multimodal Representation Guided by The Platonic Representation Hypothesis for Micro-Video Recommendation

Chengzhi Lin, Hezheng Lin, Shuchang Liu, Cangguang Ruan, LingJing Xu, Dezhao Yang, Chuyuan Wang, Yongqi Liu

TL;DR

Empirical evidence is provided supporting the potential for user interest representations to reside in a multimodal space by providing empirical evidence supporting the potential for user interest representations to reside in a multimodal space.

Abstract

The proliferation of online micro-video platforms has underscored the necessity for advanced recommender systems to mitigate information overload and deliver tailored content. Despite advancements, accurately and promptly capturing dynamic user interests remains a formidable challenge. Inspired by the Platonic Representation Hypothesis, which posits that different data modalities converge towards a shared statistical model of reality, we introduce DreamUMM (Dreaming User Multi-Modal Representation), a novel approach leveraging user historical behaviors to create real-time user representation in a multimoda space. DreamUMM employs a closed-form solution correlating user video preferences with multimodal similarity, hypothesizing that user interests can be effectively represented in a unified multimodal space. Additionally, we propose Candidate-DreamUMM for scenarios lacking recent user behavior data, inferring interests from candidate videos alone. Extensive online A/B tests demonstrate significant improvements in user engagement metrics, including active days and play count. The successful deployment of DreamUMM in two micro-video platforms with hundreds of millions of daily active users, illustrates its practical efficacy and scalability in personalized micro-video content delivery. Our work contributes to the ongoing exploration of representational convergence by providing empirical evidence supporting the potential for user interest representations to reside in a multimodal space.

Dreaming User Multimodal Representation Guided by The Platonic Representation Hypothesis for Micro-Video Recommendation

TL;DR

Empirical evidence is provided supporting the potential for user interest representations to reside in a multimodal space by providing empirical evidence supporting the potential for user interest representations to reside in a multimodal space.

Abstract

The proliferation of online micro-video platforms has underscored the necessity for advanced recommender systems to mitigate information overload and deliver tailored content. Despite advancements, accurately and promptly capturing dynamic user interests remains a formidable challenge. Inspired by the Platonic Representation Hypothesis, which posits that different data modalities converge towards a shared statistical model of reality, we introduce DreamUMM (Dreaming User Multi-Modal Representation), a novel approach leveraging user historical behaviors to create real-time user representation in a multimoda space. DreamUMM employs a closed-form solution correlating user video preferences with multimodal similarity, hypothesizing that user interests can be effectively represented in a unified multimodal space. Additionally, we propose Candidate-DreamUMM for scenarios lacking recent user behavior data, inferring interests from candidate videos alone. Extensive online A/B tests demonstrate significant improvements in user engagement metrics, including active days and play count. The successful deployment of DreamUMM in two micro-video platforms with hundreds of millions of daily active users, illustrates its practical efficacy and scalability in personalized micro-video content delivery. Our work contributes to the ongoing exploration of representational convergence by providing empirical evidence supporting the potential for user interest representations to reside in a multimodal space.
Paper Structure (29 sections, 10 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 29 sections, 10 equations, 4 figures, 2 tables, 1 algorithm.

Figures (4)

  • Figure 1: We hypothesize that user interests can be represented in a multimodal space, into which different data modalities (e.g., images and text) are projected.
  • Figure 2: DreamUMM constructs the user's multimodal representation based on the user's liking for micro videos.
  • Figure 3: Multimodal representation learning framework.
  • Figure 4: Diveristy Results of DreamUMM and Candidate-DreamUMM on two micro-video platforms. The bar chart illustrates the relative improvements in Exposed Cluster Count and Surprise Cluster metrics over the online method. Candidate-DreamUMM consistently outperforms DreamUMM across both platforms and metrics, with the most significant gains observed in the Surprise Cluster metric on Platform B. These results demonstrate the effectiveness of Candidate-DreamUMM in enhancing recommendation diversity and novelty by leveraging contextual information to capture users' real-time preferences.