Table of Contents
Fetching ...

MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising

Chenghan Fu, Daoze Zhang, Yukang Lin, Zhanheng Nie, Xiang Zhang, Jianyu Liu, Yueran Liu, Wanxian Guan, Pengjie Wang, Jian Xu, Bo Zheng

TL;DR

MOON addresses the challenge of integrating multimodal product representations into large-scale e-commerce CTR systems. It proposes a decoupled, three-stage training paradigm (Pretraining, Post-training, Application) centered on generative MLLM-based multimodal learning and a downstream content-based user behavior module (CUBE), guided by an intermediate metric called image-based search recall to align with CTR gains. The approach leverages TBStars-VL, multi-granularity representations via Matryoshka Representation Learning, and advanced data processing (deduplication, NER filtering, hard and spatial-temporal negative sampling) to achieve substantial online CTR improvements (+20%), with robust retrieval performance and cross-modal alignment. A comprehensive infrastructure supports production, consumption, and real-time perception at industrial scale, including ALake storage, a computing-in-memory representation center, dynamic loading, and low-latency embedding serving. The work demonstrates five iterative versions, explores scaling laws for tokens, negatives, and behavior sequences, and highlights tangible benefits in new-product discovery, fashion categories, and bottom-tier merchants, indicating strong practical impact for e-commerce search and advertising systems.

Abstract

We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system, including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves an overall +20.00% online CTR improvement. Over the past three years, this project has delivered the largest improvement on CTR prediction task and undergone five full-scale iterations. Throughout the exploration and iteration of our MOON, we have accumulated valuable insights and practical experience that we believe will benefit the research community. MOON contains a three-stage training paradigm of "Pretraining, Post-training, and Application", allowing effective integration of multimodal representations with downstream tasks. Notably, to bridge the misalignment between the objectives of multimodal representation learning and downstream training, we define the exchange rate to quantify how effectively improvements in an intermediate metric can translate into downstream gains. Through this analysis, we identify the image-based search recall as a critical intermediate metric guiding the optimization of multimodal models. Over three years and five iterations, MOON has evolved along four critical dimensions: data processing, training strategy, model architecture, and downstream application. The lessons and insights gained through the iterative improvements will also be shared. As part of our exploration into scaling effects in the e-commerce field, we further conduct a systematic study of the scaling laws governing multimodal representation learning, examining multiple factors such as the number of training tokens, negative samples, and the length of user behavior sequences.

MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising

TL;DR

MOON addresses the challenge of integrating multimodal product representations into large-scale e-commerce CTR systems. It proposes a decoupled, three-stage training paradigm (Pretraining, Post-training, Application) centered on generative MLLM-based multimodal learning and a downstream content-based user behavior module (CUBE), guided by an intermediate metric called image-based search recall to align with CTR gains. The approach leverages TBStars-VL, multi-granularity representations via Matryoshka Representation Learning, and advanced data processing (deduplication, NER filtering, hard and spatial-temporal negative sampling) to achieve substantial online CTR improvements (+20%), with robust retrieval performance and cross-modal alignment. A comprehensive infrastructure supports production, consumption, and real-time perception at industrial scale, including ALake storage, a computing-in-memory representation center, dynamic loading, and low-latency embedding serving. The work demonstrates five iterative versions, explores scaling laws for tokens, negatives, and behavior sequences, and highlights tangible benefits in new-product discovery, fashion categories, and bottom-tier merchants, indicating strong practical impact for e-commerce search and advertising systems.

Abstract

We introduce MOON, our comprehensive set of sustainable iterative practices for multimodal representation learning for e-commerce applications. MOON has already been fully deployed across all stages of Taobao search advertising system, including retrieval, relevance, ranking, and so on. The performance gains are particularly significant on click-through rate (CTR) prediction task, which achieves an overall +20.00% online CTR improvement. Over the past three years, this project has delivered the largest improvement on CTR prediction task and undergone five full-scale iterations. Throughout the exploration and iteration of our MOON, we have accumulated valuable insights and practical experience that we believe will benefit the research community. MOON contains a three-stage training paradigm of "Pretraining, Post-training, and Application", allowing effective integration of multimodal representations with downstream tasks. Notably, to bridge the misalignment between the objectives of multimodal representation learning and downstream training, we define the exchange rate to quantify how effectively improvements in an intermediate metric can translate into downstream gains. Through this analysis, we identify the image-based search recall as a critical intermediate metric guiding the optimization of multimodal models. Over three years and five iterations, MOON has evolved along four critical dimensions: data processing, training strategy, model architecture, and downstream application. The lessons and insights gained through the iterative improvements will also be shared. As part of our exploration into scaling effects in the e-commerce field, we further conduct a systematic study of the scaling laws governing multimodal representation learning, examining multiple factors such as the number of training tokens, negative samples, and the length of user behavior sequences.

Paper Structure

This paper contains 41 sections, 4 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: The overview of the architecture and infrastructure of our MOON's latest iteration.
  • Figure 2: Comparison between the dual-flow and MLLM architectures. (a) The dual-flow paradigm is inherently limited to encode one-to-one image-text pairs and cannot directly capture many-to-one relationships. (b) The MLLM-based idea is naturally suited to model the richer visual content from multiple SKU images.
  • Figure 3: The architecture and training of our MOON model. (a) Beyond the dual-encoder paradigm, we employ the generative-MLLM-based method for product representation learning. (b) In post-training, we leverage real-world user behaviors as supervision to effectively capture latent correlations between related product items. (c) Moreover, during the second post-training stage, we not only construct hard negative samples, but also use the Spatial-Temporal Negative Sampling, to learn more robust and discriminative representations.
  • Figure 4: Illustration of the Spatial-Temporal Negative Sampling. (a) For trivial in-batch sampling, negative items are sampled from other items in the same batch. (b) We prepare a similar item belonging to the same category as the query as a hard negative item. (c) Taking the sample $s_1$ as an example, we greatly expand the negative pool from both spatial and temporal dimensions.
  • Figure 5: The architecture of content-based user behavior extractor (CUBE) for downstream CTR task. The behavior sequence includes the item behavior sequence and the query behavior sequence.
  • ...and 7 more figures