Table of Contents
Fetching ...

Advancing Re-Ranking with Multimodal Fusion and Target-Oriented Auxiliary Tasks in E-Commerce Search

Enqiang Xu, Xinhui Li, Zhigong Zhou, Jiahao Ji, Jinyuan Zhao, Dadong Miao, Songlin Wang, Lin Liu, Sulong Xu

TL;DR

The paper addresses the gap in leveraging multimodal information for e-commerce search re-ranking by proposing ARMMT, which combines an attention-based Context-Aware Fusion Unit with a Multi-Perspective Self-Attention mechanism and a multimodal auxiliary task to fuse text and image cues into item representations and personalized items. It introduces separate item and personalized representations, a hierarchical fusion strategy, and auxiliary supervision to align multimodal signals with the ranking objective. Offline results show an AUC of $0.9647$ with a $0.0005$ gain over a strong baseline, while online A/B testing reports CVR improvements of $0.22\%$ and GMV improvements of $0.49\%$, validating commercial viability on JD.com. The approach demonstrates that integrating multimodal signals, with context-aware fusion and auxiliary tasks, enhances personalization and conversion in e-commerce search, and it points to broader opportunities for incorporating additional modalities and dynamic ranking objectives.

Abstract

In the rapidly evolving field of e-commerce, the effectiveness of search re-ranking models is crucial for enhancing user experience and driving conversion rates. Despite significant advancements in feature representation and model architecture, the integration of multimodal information remains underexplored. This study addresses this gap by investigating the computation and fusion of textual and visual information in the context of re-ranking. We propose \textbf{A}dvancing \textbf{R}e-Ranking with \textbf{M}ulti\textbf{m}odal Fusion and \textbf{T}arget-Oriented Auxiliary Tasks (ARMMT), which integrates an attention-based multimodal fusion technique and an auxiliary ranking-aligned task to enhance item representation and improve targeting capabilities. This method not only enriches the understanding of product attributes but also enables more precise and personalized recommendations. Experimental evaluations on JD.com's search platform demonstrate that ARMMT achieves state-of-the-art performance in multimodal information integration, evidenced by a 0.22\% increase in the Conversion Rate (CVR), significantly contributing to Gross Merchandise Volume (GMV). This pioneering approach has the potential to revolutionize e-commerce re-ranking, leading to elevated user satisfaction and business growth.

Advancing Re-Ranking with Multimodal Fusion and Target-Oriented Auxiliary Tasks in E-Commerce Search

TL;DR

The paper addresses the gap in leveraging multimodal information for e-commerce search re-ranking by proposing ARMMT, which combines an attention-based Context-Aware Fusion Unit with a Multi-Perspective Self-Attention mechanism and a multimodal auxiliary task to fuse text and image cues into item representations and personalized items. It introduces separate item and personalized representations, a hierarchical fusion strategy, and auxiliary supervision to align multimodal signals with the ranking objective. Offline results show an AUC of with a gain over a strong baseline, while online A/B testing reports CVR improvements of and GMV improvements of , validating commercial viability on JD.com. The approach demonstrates that integrating multimodal signals, with context-aware fusion and auxiliary tasks, enhances personalization and conversion in e-commerce search, and it points to broader opportunities for incorporating additional modalities and dynamic ranking objectives.

Abstract

In the rapidly evolving field of e-commerce, the effectiveness of search re-ranking models is crucial for enhancing user experience and driving conversion rates. Despite significant advancements in feature representation and model architecture, the integration of multimodal information remains underexplored. This study addresses this gap by investigating the computation and fusion of textual and visual information in the context of re-ranking. We propose \textbf{A}dvancing \textbf{R}e-Ranking with \textbf{M}ulti\textbf{m}odal Fusion and \textbf{T}arget-Oriented Auxiliary Tasks (ARMMT), which integrates an attention-based multimodal fusion technique and an auxiliary ranking-aligned task to enhance item representation and improve targeting capabilities. This method not only enriches the understanding of product attributes but also enables more precise and personalized recommendations. Experimental evaluations on JD.com's search platform demonstrate that ARMMT achieves state-of-the-art performance in multimodal information integration, evidenced by a 0.22\% increase in the Conversion Rate (CVR), significantly contributing to Gross Merchandise Volume (GMV). This pioneering approach has the potential to revolutionize e-commerce re-ranking, leading to elevated user satisfaction and business growth.
Paper Structure (27 sections, 17 equations, 3 figures, 3 tables)

This paper contains 27 sections, 17 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The framework of Advancing Re-Ranking with Multimodal Fusion and Target-Oriented Auxiliary Tasks (ARMMT).
  • Figure 2: The encoding process of textual and image information. Effective information from user behavior sequences is extracted through multi-head attention.
  • Figure 3: The diagram of the Context-Aware Fusion UNIT. In this diagram, triangles, rectangles, and circles represent context, text, and image features, respectively.