Table of Contents
Fetching ...

Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

Xiang-Rong Sheng, Feifan Yang, Litong Gong, Biao Wang, Zhangming Chan, Yujing Zhang, Yueyao Cheng, Yong-Nan Zhu, Tiezheng Ge, Han Zhu, Yuning Jiang, Jian Xu, Bo Zheng

TL;DR

This paper tackles the challenge of leveraging multimodal data in Taobao's large-scale display advertising by proposing a practical two-phase framework: pre-train multimodal representations with semantic-aware contrastive learning (SCL) to capture semantic similarity, and then integrate these representations with existing ID-based models using two methods, SimTier and MAKE, to address training-epoch disparities. An industrial deployment design enables real-time generation and indexing of multimodal representations for new items, achieving latency of only a few seconds and enabling near-line training and online prediction. Empirical results show that SCL significantly outperforms semantic-agnostic baselines, and that combining SimTier with MAKE yields the largest improvements in CTR prediction and downstream GAUC/AUC, especially for long-tail items and new ads. Since mid-2023, these multimodal representations have delivered meaningful online gains (e.g., CTR, RPM, ROI), validating the practical utility of multimodal signals in industrial recommender systems and offering a concrete blueprint for practitioners seeking to adopt similar approaches.

Abstract

Despite the recognized potential of multimodal data to improve model accuracy, many large-scale industrial recommendation systems, including Taobao display advertising system, predominantly depend on sparse ID features in their models. In this work, we explore approaches to leverage multimodal data to enhance the recommendation accuracy. We start from identifying the key challenges in adopting multimodal data in a manner that is both effective and cost-efficient for industrial systems. To address these challenges, we introduce a two-phase framework, including: 1) the pre-training of multimodal representations to capture semantic similarity, and 2) the integration of these representations with existing ID-based models. Furthermore, we detail the architecture of our production system, which is designed to facilitate the deployment of multimodal representations. Since the integration of multimodal representations in mid-2023, we have observed significant performance improvements in Taobao display advertising system. We believe that the insights we have gathered will serve as a valuable resource for practitioners seeking to leverage multimodal data in their systems.

Enhancing Taobao Display Advertising with Multimodal Representations: Challenges, Approaches and Insights

TL;DR

This paper tackles the challenge of leveraging multimodal data in Taobao's large-scale display advertising by proposing a practical two-phase framework: pre-train multimodal representations with semantic-aware contrastive learning (SCL) to capture semantic similarity, and then integrate these representations with existing ID-based models using two methods, SimTier and MAKE, to address training-epoch disparities. An industrial deployment design enables real-time generation and indexing of multimodal representations for new items, achieving latency of only a few seconds and enabling near-line training and online prediction. Empirical results show that SCL significantly outperforms semantic-agnostic baselines, and that combining SimTier with MAKE yields the largest improvements in CTR prediction and downstream GAUC/AUC, especially for long-tail items and new ads. Since mid-2023, these multimodal representations have delivered meaningful online gains (e.g., CTR, RPM, ROI), validating the practical utility of multimodal signals in industrial recommender systems and offering a concrete blueprint for practitioners seeking to adopt similar approaches.

Abstract

Despite the recognized potential of multimodal data to improve model accuracy, many large-scale industrial recommendation systems, including Taobao display advertising system, predominantly depend on sparse ID features in their models. In this work, we explore approaches to leverage multimodal data to enhance the recommendation accuracy. We start from identifying the key challenges in adopting multimodal data in a manner that is both effective and cost-efficient for industrial systems. To address these challenges, we introduce a two-phase framework, including: 1) the pre-training of multimodal representations to capture semantic similarity, and 2) the integration of these representations with existing ID-based models. Furthermore, we detail the architecture of our production system, which is designed to facilitate the deployment of multimodal representations. Since the integration of multimodal representations in mid-2023, we have observed significant performance improvements in Taobao display advertising system. We believe that the insights we have gathered will serve as a valuable resource for practitioners seeking to leverage multimodal data in their systems.
Paper Structure (27 sections, 5 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 5 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: An overview of our two-phase framework: the pre-training of multimodal representations, followed by the integration of pre-trained representations into recommendation models. In the first phase (refer to Figure (a)), we undertake pre-training through semantic-aware contrastive learning. This method equips the multimodal representations with ability to identify semantic similar items. Subsequently, in the second phase (refer to Figure (b)), we introduce our proposed SimTier and make methods to effectively incorporate the pre-trained multimodal representations into the recommendation models.
  • Figure 2: An illustration of the proposed SimTier and Multimodal Knowledge Extractor (MAKE) approaches, with details provided in Section \ref{['sec:simtier']} and \ref{['sec:make']}, respectively.
  • Figure 3: The multimodal-based CTR prediction model (MM) demonstrates a continuous increase in GAUC after several training epochs. In contrast, the ID-based model (ID) shows a sharp decline in GAUC during testing after the second epoch of training.
  • Figure 4: An overview of the online system.
  • Figure 5: A case of the pre-training dataset.
  • ...and 1 more figures