Table of Contents
Fetching ...

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

Bin Wu, Feifan Yang, Zhangming Chan, Yu-Ran Gu, Jiawei Feng, Chao Yi, Xiang-Rong Sheng, Han Zhu, Jian Xu, Mang Ye, Bo Zheng

TL;DR

MUSE reveals that lifelong multimodal CTR benefits arise from a simple GSU using high-quality multimodal embeddings and a richer ESU that explicitly models multimodal sequences and fuses semantic signals with ID information. The framework achieves state-of-the-art offline and online performance, scales to ultra-long behavior histories, and incurs negligible latency in production. By open-sourcing large-scale multimodal Lifelong behavior data and deployment practices, the work promotes reproducibility and further research in multimodal lifelong modeling. The key contributions are a principled analysis of GSU vs ESU design, the SimTier and SA-TA components, and practical deployment insights in a real-world Taobao system.

Abstract

Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its full potential. Guided by these principles, we propose MUSE, a simple yet effective multimodal search-based framework. MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling and delivering significant gains in top-line metrics with negligible online latency overhead. To foster community research, we share industrial deployment practices and open-source the first large-scale dataset featuring ultra-long behavior sequences paired with high-quality multimodal embeddings. Our code and data is available at https://taobao-mm.github.io.

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

TL;DR

MUSE reveals that lifelong multimodal CTR benefits arise from a simple GSU using high-quality multimodal embeddings and a richer ESU that explicitly models multimodal sequences and fuses semantic signals with ID information. The framework achieves state-of-the-art offline and online performance, scales to ultra-long behavior histories, and incurs negligible latency in production. By open-sourcing large-scale multimodal Lifelong behavior data and deployment practices, the work promotes reproducibility and further research in multimodal lifelong modeling. The key contributions are a principled analysis of GSU vs ESU design, the SimTier and SA-TA components, and practical deployment insights in a real-world Taobao system.

Abstract

Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its full potential. Guided by these principles, we propose MUSE, a simple yet effective multimodal search-based framework. MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling and delivering significant gains in top-line metrics with negligible online latency overhead. To foster community research, we share industrial deployment practices and open-source the first large-scale dataset featuring ultra-long behavior sequences paired with high-quality multimodal embeddings. Our code and data is available at https://taobao-mm.github.io.

Paper Structure

This paper contains 38 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of MUSE. (a) Multimodal item embeddings are pre-trained via Semantic-aware Contrastive Learning (SCL). In the recommendation phase, (b) the GSU stage efficiently retrieves the top-$K$ behaviors most relevant to the target item from the user’s lifelong history using lightweight multimodal cosine similarity, drastically reducing the sequence length for downstream processing. (c) The ESU stage models fine-grained user interests through two components: the SimTier module compresses multimodal similarity sequences into histograms, while the Semantic-Aware Target Attention (SA-TA) module enriches ID-based attention with semantic guidance to produce the final lifelong user interest representation.
  • Figure 2: Performance of different multimodal representations. ESU clearly favors fine-grained representations.
  • Figure 3: Semantic-Aware Target Attention (SA-TA) augments ID-based attention by incorporating multimodal semantic similarity.
  • Figure 4: Online deployment of MUSE in Taobao display advertising system. GSU pre-fetches the user behavior sequence and multimodal embeddings asynchronously alongside the matching stage, and the cached outputs are consumed by GSU Top-K selection and ESU modeling during ranking.
  • Figure 5: Performance when GSU Takes Different Behavior Sequence Lengths. Left: GAUC values. Right: Relative GAUC improvement of MUSE using MM-Enhanced ESU compared to ID-Only ESU at different sequence lengths.
  • ...and 2 more figures