Table of Contents
Fetching ...

MIM: Multi-modal Content Interest Modeling Paradigm for User Behavior Modeling

Bencheng Yan, Si Chen, Shichang Jia, Jianyu Liu, Yueran Liu, Chenghan Fu, Wanxian Guan, Hui Zhao, Xiang Zhang, Kai Zhang, Wenbo Su, Pengjie Wang, Jian Xu, Bo Zheng, Baolin Liu

TL;DR

This work tackles CTR prediction by moving from ID-centric representations to content-based user interests through multi-modal content embeddings. It introduces MIM, a universal three-stage paradigm consisting of Pre-training, Content-Interest-Aware Supervised Fine-Tuning (C-SFT), and Content-Interest-Aware UBM (CiUBM), complemented by a Representation Center for efficient retrieval. Key innovations include downstream data adaptation and modal alignment in pre-training, a contrastive C-SFT with space-time negative sampling and multi-level InfoNCE, and a modular CiUBM that fuses ID and multi-modal signals. In industrial deployment on Taobao, MIM yields substantial offline and online gains in CTR and RPM, confirming its practical impact for large-scale, content-aware recommendation and search systems.

Abstract

Click-Through Rate (CTR) prediction is a crucial task in recommendation systems, online searches, and advertising platforms, where accurately capturing users' real interests in content is essential for performance. However, existing methods heavily rely on ID embeddings, which fail to reflect users' true preferences for content such as images and titles. This limitation becomes particularly evident in cold-start and long-tail scenarios, where traditional approaches struggle to deliver effective results. To address these challenges, we propose a novel Multi-modal Content Interest Modeling paradigm (MIM), which consists of three key stages: Pre-training, Content-Interest-Aware Supervised Fine-Tuning (C-SFT), and Content-Interest-Aware UBM (CiUBM). The pre-training stage adapts foundational models to domain-specific data, enabling the extraction of high-quality multi-modal embeddings. The C-SFT stage bridges the semantic gap between content and user interests by leveraging user behavior signals to guide the alignment of embeddings with user preferences. Finally, the CiUBM stage integrates multi-modal embeddings and ID-based collaborative filtering signals into a unified framework. Comprehensive offline experiments and online A/B tests conducted on the Taobao, one of the world's largest e-commerce platforms, demonstrated the effectiveness and efficiency of MIM method. The method has been successfully deployed online, achieving a significant increase of +14.14% in CTR and +4.12% in RPM, showcasing its industrial applicability and substantial impact on platform performance. To promote further research, we have publicly released the code and dataset at https://pan.quark.cn/s/8fc8ec3e74f3.

MIM: Multi-modal Content Interest Modeling Paradigm for User Behavior Modeling

TL;DR

This work tackles CTR prediction by moving from ID-centric representations to content-based user interests through multi-modal content embeddings. It introduces MIM, a universal three-stage paradigm consisting of Pre-training, Content-Interest-Aware Supervised Fine-Tuning (C-SFT), and Content-Interest-Aware UBM (CiUBM), complemented by a Representation Center for efficient retrieval. Key innovations include downstream data adaptation and modal alignment in pre-training, a contrastive C-SFT with space-time negative sampling and multi-level InfoNCE, and a modular CiUBM that fuses ID and multi-modal signals. In industrial deployment on Taobao, MIM yields substantial offline and online gains in CTR and RPM, confirming its practical impact for large-scale, content-aware recommendation and search systems.

Abstract

Click-Through Rate (CTR) prediction is a crucial task in recommendation systems, online searches, and advertising platforms, where accurately capturing users' real interests in content is essential for performance. However, existing methods heavily rely on ID embeddings, which fail to reflect users' true preferences for content such as images and titles. This limitation becomes particularly evident in cold-start and long-tail scenarios, where traditional approaches struggle to deliver effective results. To address these challenges, we propose a novel Multi-modal Content Interest Modeling paradigm (MIM), which consists of three key stages: Pre-training, Content-Interest-Aware Supervised Fine-Tuning (C-SFT), and Content-Interest-Aware UBM (CiUBM). The pre-training stage adapts foundational models to domain-specific data, enabling the extraction of high-quality multi-modal embeddings. The C-SFT stage bridges the semantic gap between content and user interests by leveraging user behavior signals to guide the alignment of embeddings with user preferences. Finally, the CiUBM stage integrates multi-modal embeddings and ID-based collaborative filtering signals into a unified framework. Comprehensive offline experiments and online A/B tests conducted on the Taobao, one of the world's largest e-commerce platforms, demonstrated the effectiveness and efficiency of MIM method. The method has been successfully deployed online, achieving a significant increase of +14.14% in CTR and +4.12% in RPM, showcasing its industrial applicability and substantial impact on platform performance. To promote further research, we have publicly released the code and dataset at https://pan.quark.cn/s/8fc8ec3e74f3.

Paper Structure

This paper contains 27 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: An example of ID interest and content interest modeling in UBM.
  • Figure 2: The framework of MIM. There are a total of three stages, including Pre-training, C-SFT, and CiUBM. Besides, a representation center is built for efficiency consideration.
  • Figure 3: The example of negative sample generation. (a) Negative items are obtained from other items in the same batch. (b) adding a hard negative item (b) Taking the sample $s_1$ as an example, with the help of ST-NGS, the amount of negative items (i.e., the red items) can be added.
  • Figure 4: The framework of representation center.
  • Figure 5: Evaluation of the impact of different FoMs including $F_V$ and $F_L$. From left to right, a more powerful FoM is adopted.
  • ...and 1 more figures