Advancing Re-Ranking with Multimodal Fusion and Target-Oriented Auxiliary Tasks in E-Commerce Search

Enqiang Xu; Xinhui Li; Zhigong Zhou; Jiahao Ji; Jinyuan Zhao; Dadong Miao; Songlin Wang; Lin Liu; Sulong Xu

Advancing Re-Ranking with Multimodal Fusion and Target-Oriented Auxiliary Tasks in E-Commerce Search

Enqiang Xu, Xinhui Li, Zhigong Zhou, Jiahao Ji, Jinyuan Zhao, Dadong Miao, Songlin Wang, Lin Liu, Sulong Xu

TL;DR

The paper addresses the gap in leveraging multimodal information for e-commerce search re-ranking by proposing ARMMT, which combines an attention-based Context-Aware Fusion Unit with a Multi-Perspective Self-Attention mechanism and a multimodal auxiliary task to fuse text and image cues into item representations and personalized items. It introduces separate item and personalized representations, a hierarchical fusion strategy, and auxiliary supervision to align multimodal signals with the ranking objective. Offline results show an AUC of $0.9647$ with a $0.0005$ gain over a strong baseline, while online A/B testing reports CVR improvements of $0.22\%$ and GMV improvements of $0.49\%$, validating commercial viability on JD.com. The approach demonstrates that integrating multimodal signals, with context-aware fusion and auxiliary tasks, enhances personalization and conversion in e-commerce search, and it points to broader opportunities for incorporating additional modalities and dynamic ranking objectives.

Abstract

In the rapidly evolving field of e-commerce, the effectiveness of search re-ranking models is crucial for enhancing user experience and driving conversion rates. Despite significant advancements in feature representation and model architecture, the integration of multimodal information remains underexplored. This study addresses this gap by investigating the computation and fusion of textual and visual information in the context of re-ranking. We propose \textbf{A}dvancing \textbf{R}e-Ranking with \textbf{M}ulti\textbf{m}odal Fusion and \textbf{T}arget-Oriented Auxiliary Tasks (ARMMT), which integrates an attention-based multimodal fusion technique and an auxiliary ranking-aligned task to enhance item representation and improve targeting capabilities. This method not only enriches the understanding of product attributes but also enables more precise and personalized recommendations. Experimental evaluations on JD.com's search platform demonstrate that ARMMT achieves state-of-the-art performance in multimodal information integration, evidenced by a 0.22\% increase in the Conversion Rate (CVR), significantly contributing to Gross Merchandise Volume (GMV). This pioneering approach has the potential to revolutionize e-commerce re-ranking, leading to elevated user satisfaction and business growth.

Advancing Re-Ranking with Multimodal Fusion and Target-Oriented Auxiliary Tasks in E-Commerce Search

TL;DR

with a

gain over a strong baseline, while online A/B testing reports CVR improvements of

and GMV improvements of

, validating commercial viability on JD.com. The approach demonstrates that integrating multimodal signals, with context-aware fusion and auxiliary tasks, enhances personalization and conversion in e-commerce search, and it points to broader opportunities for incorporating additional modalities and dynamic ranking objectives.

Abstract

Paper Structure (27 sections, 17 equations, 3 figures, 3 tables)

This paper contains 27 sections, 17 equations, 3 figures, 3 tables.

Introduction
RELATED WORK
Re-ranking Models
Multimodal Fusion
Preliminaries
Background
Base Model
ID-based Deep Interest Network
Context-based Transformer Encoder
Method
Multimodal representations
Multimodal Representation of Item
Multimodal Representation of Personalized Item
Hierarchical Multimodal Fusion
Context-Aware Fusion UNIT
...and 12 more sections

Figures (3)

Figure 1: The framework of Advancing Re-Ranking with Multimodal Fusion and Target-Oriented Auxiliary Tasks (ARMMT).
Figure 2: The encoding process of textual and image information. Effective information from user behavior sequences is extracted through multi-head attention.
Figure 3: The diagram of the Context-Aware Fusion UNIT. In this diagram, triangles, rectangles, and circles represent context, text, and image features, respectively.

Advancing Re-Ranking with Multimodal Fusion and Target-Oriented Auxiliary Tasks in E-Commerce Search

TL;DR

Abstract

Advancing Re-Ranking with Multimodal Fusion and Target-Oriented Auxiliary Tasks in E-Commerce Search

Authors

TL;DR

Abstract

Table of Contents

Figures (3)