MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

Bin Wu; Feifan Yang; Zhangming Chan; Yu-Ran Gu; Jiawei Feng; Chao Yi; Xiang-Rong Sheng; Han Zhu; Jian Xu; Mang Ye; Bo Zheng

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

Bin Wu, Feifan Yang, Zhangming Chan, Yu-Ran Gu, Jiawei Feng, Chao Yi, Xiang-Rong Sheng, Han Zhu, Jian Xu, Mang Ye, Bo Zheng

TL;DR

MUSE reveals that lifelong multimodal CTR benefits arise from a simple GSU using high-quality multimodal embeddings and a richer ESU that explicitly models multimodal sequences and fuses semantic signals with ID information. The framework achieves state-of-the-art offline and online performance, scales to ultra-long behavior histories, and incurs negligible latency in production. By open-sourcing large-scale multimodal Lifelong behavior data and deployment practices, the work promotes reproducibility and further research in multimodal lifelong modeling. The key contributions are a principled analysis of GSU vs ESU design, the SimTier and SA-TA components, and practical deployment insights in a real-world Taobao system.

Abstract

Lifelong user interest modeling is crucial for industrial recommender systems, yet existing approaches rely predominantly on ID-based features, suffering from poor generalization on long-tail items and limited semantic expressiveness. While recent work explores multimodal representations for behavior retrieval in the General Search Unit (GSU), they often neglect multimodal integration in the fine-grained modeling stage -- the Exact Search Unit (ESU). In this work, we present a systematic analysis of how to effectively leverage multimodal signals across both stages of the two-stage lifelong modeling framework. Our key insight is that simplicity suffices in the GSU: lightweight cosine similarity with high-quality multimodal embeddings outperforms complex retrieval mechanisms. In contrast, the ESU demands richer multimodal sequence modeling and effective ID-multimodal fusion to unlock its full potential. Guided by these principles, we propose MUSE, a simple yet effective multimodal search-based framework. MUSE has been deployed in Taobao display advertising system, enabling 100K-length user behavior sequence modeling and delivering significant gains in top-line metrics with negligible online latency overhead. To foster community research, we share industrial deployment practices and open-source the first large-scale dataset featuring ultra-long behavior sequences paired with high-quality multimodal embeddings. Our code and data is available at https://taobao-mm.github.io.

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

TL;DR

Abstract

MUSE: A Simple Yet Effective Multimodal Search-Based Framework for Lifelong User Interest Modeling

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)