Table of Contents
Fetching ...

M3TR: Temporal Retrieval Enhanced Multi-Modal Micro-video Popularity Prediction

Jiacheng Lu, Weijian Wang, Mingyuan Xiao, Yang Hua, Tao Song, Jiaru Zhang, Bo Peng, Cheng Hua, Haibing Guan

TL;DR

A fine-grained temporal modeling framework that uniquely synergizes fine-grained temporal modeling with a novel temporal-aware retrieval process for micro-video popularity prediction, and achieves state-of-the-art performance and significant gains in addressing long-term prediction challenges.

Abstract

Accurately predicting the popularity of micro-videos is a critical but challenging task, characterized by volatile, `rollercoaster-like' engagement dynamics. Existing methods often fail to capture these complex temporal patterns, leading to inaccurate long-term forecasts. This failure stems from two fundamental limitations: \ding{172} a superficial understanding of user feedback dynamics, which overlooks the mutually exciting and decaying nature of interactions such as likes, comments, and shares; and~\ding{173} retrieval mechanisms that rely solely on static content similarity, ignoring the crucial patterns of how a video's popularity evolves over time. To address these limitations, we propose \textbf{M$^3$TR}, a \textbf{T}emporal \textbf{R}etrieval enhanced \textbf{M}ulti-\textbf{M}odal framework that uniquely synergizes fine-grained temporal modeling with a novel temporal-aware retrieval process for \textbf{M}icro-video popularity prediction. At its core, M$^3$TR introduces a Mamba-Hawkes Process (MHP) module to explicitly model user feedback as a sequence of self-exciting events, capturing the intricate, long-range dependencies within user interactions (for \textbf{limitation} \ding{172}). This rich temporal representation then powers a temporal-aware retrieval engine that identifies historically relevant videos based on a combined similarity of both their multi-modal content (visual, audio, text) and their popularity trajectories (for \textbf{limitation} \ding{173}). By augmenting the target video's features with this retrieved knowledge, M$^3$TR achieves a comprehensive understanding of prediction. Extensive experiments on two real-world datasets demonstrate the superiority of our framework. M$^3$TR achieves state-of-the-art performance, outperforming previous methods by up to \textbf{19.3}\% in nMSE and showing significant gains in addressing long-term prediction challenges.

M3TR: Temporal Retrieval Enhanced Multi-Modal Micro-video Popularity Prediction

TL;DR

A fine-grained temporal modeling framework that uniquely synergizes fine-grained temporal modeling with a novel temporal-aware retrieval process for micro-video popularity prediction, and achieves state-of-the-art performance and significant gains in addressing long-term prediction challenges.

Abstract

Accurately predicting the popularity of micro-videos is a critical but challenging task, characterized by volatile, `rollercoaster-like' engagement dynamics. Existing methods often fail to capture these complex temporal patterns, leading to inaccurate long-term forecasts. This failure stems from two fundamental limitations: \ding{172} a superficial understanding of user feedback dynamics, which overlooks the mutually exciting and decaying nature of interactions such as likes, comments, and shares; and~\ding{173} retrieval mechanisms that rely solely on static content similarity, ignoring the crucial patterns of how a video's popularity evolves over time. To address these limitations, we propose \textbf{MTR}, a \textbf{T}emporal \textbf{R}etrieval enhanced \textbf{M}ulti-\textbf{M}odal framework that uniquely synergizes fine-grained temporal modeling with a novel temporal-aware retrieval process for \textbf{M}icro-video popularity prediction. At its core, MTR introduces a Mamba-Hawkes Process (MHP) module to explicitly model user feedback as a sequence of self-exciting events, capturing the intricate, long-range dependencies within user interactions (for \textbf{limitation} \ding{172}). This rich temporal representation then powers a temporal-aware retrieval engine that identifies historically relevant videos based on a combined similarity of both their multi-modal content (visual, audio, text) and their popularity trajectories (for \textbf{limitation} \ding{173}). By augmenting the target video's features with this retrieved knowledge, MTR achieves a comprehensive understanding of prediction. Extensive experiments on two real-world datasets demonstrate the superiority of our framework. MTR achieves state-of-the-art performance, outperforming previous methods by up to \textbf{19.3}\% in nMSE and showing significant gains in addressing long-term prediction challenges.

Paper Structure

This paper contains 86 sections, 8 theorems, 49 equations, 12 figures, 5 tables, 1 algorithm.

Key Result

theorem 1

Let the d-dimensional counting process $N(t)$ be generated by a multivariate Hawkes process with its true conditional intensity vector governed by a sparse parameter vector $\theta^* \in \mathbb{R}^p$. Let $A = \{j | \theta_j^* \neq 0\}$ be the true active set of size $|A| = q \ll p$. The estimator where $L_N(\theta)$ is the normalized log-likelihood, and $p_{\gamma_N}(\cdot)$ is a non-concave pe

Figures (12)

  • Figure 1: Illustration of the core motivation for M$^3$TR. Given a target video (left) exhibiting a "flash-in-the-pan" popularity trajectory, a conventional Content-Only Retrieval system (middle) identifies videos that are thematically similar (e.g., other cat videos) but whose popularity patterns are fundamentally different, offering poor predictive value. In stark contrast, our M$^3$TR Retrieval (right), which is temporal-aware, retrieves videos with entirely different content (e.g., "cooking fail" and "dance trend") but nearly identical popularity trajectories. This demonstrates that the temporal user feedback of videos is a critical predictive signal that is missed by content-centric approaches.
  • Figure 1: Comparisons of Different Scenarios of Video Popularity Trends. (a) Steady growth, (b) Early surge with later decline, (c) Noise from fake likes.
  • Figure 2: Overall framework. The workflow begins with two input streams: the Input Video, from which multi-modal features (vision, audio, text) are extracted, and its corresponding temporal user feedback. M$^3$TR employs a Multi-Modal Extraction module for vision, audio and text. Meanwhile, it applies Temporal Dynamics Modeling with a Mamba-Hawkes Process (MHP) for temporal user feedback, which explicitly captures the long-range, self-exciting nature of user interactions, generating a sophisticated temporal feature representation ($X_i^s$). This rich temporal feature powers the other core innovation of our framework: the Temporal-Aware Retrieval engine. Unlike conventional methods that rely on static content similarity, our engine queries the Memory Bank to identify historical videos based on a combined similarity of both their popularity trajectories (derived from $X_i^s$) and their multi-modal content. It retrieves and averages the features of the most relevant historical examples to form an augmented feature vector ($X_i^R$). Finally, in the Interaction and Prediction stage, the model fuses the video's original multi-modal features ($X_i$) with the retrieved temporal-aware features ($X_i^R$) using Paired Cross Attention. The resulting representation is then refined through Attentive Pooling and Non-Linear Layers to generate the final Popularity Prediction.
  • Figure 2: Diversity of Micro-Video Popularity Trajectories
  • Figure 3: The Mamba-Hawkes Process (MHP) Architecture. It computes a dynamic intensity function by combining a classical Hawkes process with a non-linear context vector generated by a Mamba network that processes the entire event history.
  • ...and 7 more figures

Theorems & Definitions (23)

  • theorem 1: Oracle Properties of the MHP Estimator
  • definition 1: Counting Process
  • definition 2: Point Process
  • definition 3: Relationship: Counting & Point Processes
  • definition 4: Inter-Arrival Times
  • definition 5: Filtration
  • definition 6: History of a Counting Process
  • definition 7: Conditional Intensity Process
  • Remark 1
  • definition 8: Compensator
  • ...and 13 more