Table of Contents
Fetching ...

MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce

Hao Jiang, Haoxiang Zhang, Qingshan Hou, Chaofeng Chen, Weisi Lin, Jingchang Zhang, Annan Wang

TL;DR

MRSE tackles the challenge of reliable large-scale item recall by integrating text, images, and user preferences through a two-stage, multi-modality embedding framework. It introduces three lightweight experts (VBert, FtAtt, Light-Bert) and a DSSM-based fusion, guided by a hybrid loss that blends in-batch softmax and hard-negative triplet objectives to align cross-modal features. The approach yields strong offline gains (18.9% relevance) and online impact (3.7% GMV uplift) on Shopee, and is deployed as a base model in production, with a deployment strategy that includes Q2I and I2I components for efficiency and diversity. This work demonstrates the practical viability and business value of modality-aware retrieval in large-scale e-commerce systems and points to future improvements through temporal multi-modal signals.

Abstract

Providing high-quality item recall for text queries is crucial in large-scale e-commerce search systems. Current Embedding-based Retrieval Systems (ERS) embed queries and items into a shared low-dimensional space, but uni-modality ERS rely too heavily on textual features, making them unreliable in complex contexts. While multi-modality ERS incorporate various data sources, they often overlook individual preferences for different modalities, leading to suboptimal results. To address these issues, we propose MRSE, a Multi-modality Retrieval System that integrates text, item images, and user preferences through lightweight mixture-of-expert (LMoE) modules to better align features across and within modalities. MRSE also builds user profiles at a multi-modality level and introduces a novel hybrid loss function that enhances consistency and robustness using hard negative sampling. Experiments on a large-scale dataset from Shopee and online A/B testing show that MRSE achieves an 18.9% improvement in offline relevance and a 3.7% gain in online core metrics compared to Shopee's state-of-the-art uni-modality system.

MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce

TL;DR

MRSE tackles the challenge of reliable large-scale item recall by integrating text, images, and user preferences through a two-stage, multi-modality embedding framework. It introduces three lightweight experts (VBert, FtAtt, Light-Bert) and a DSSM-based fusion, guided by a hybrid loss that blends in-batch softmax and hard-negative triplet objectives to align cross-modal features. The approach yields strong offline gains (18.9% relevance) and online impact (3.7% GMV uplift) on Shopee, and is deployed as a base model in production, with a deployment strategy that includes Q2I and I2I components for efficiency and diversity. This work demonstrates the practical viability and business value of modality-aware retrieval in large-scale e-commerce systems and points to future improvements through temporal multi-modal signals.

Abstract

Providing high-quality item recall for text queries is crucial in large-scale e-commerce search systems. Current Embedding-based Retrieval Systems (ERS) embed queries and items into a shared low-dimensional space, but uni-modality ERS rely too heavily on textual features, making them unreliable in complex contexts. While multi-modality ERS incorporate various data sources, they often overlook individual preferences for different modalities, leading to suboptimal results. To address these issues, we propose MRSE, a Multi-modality Retrieval System that integrates text, item images, and user preferences through lightweight mixture-of-expert (LMoE) modules to better align features across and within modalities. MRSE also builds user profiles at a multi-modality level and introduces a novel hybrid loss function that enhances consistency and robustness using hard negative sampling. Experiments on a large-scale dataset from Shopee and online A/B testing show that MRSE achieves an 18.9% improvement in offline relevance and a 3.7% gain in online core metrics compared to Shopee's state-of-the-art uni-modality system.
Paper Structure (32 sections, 15 equations, 5 figures, 4 tables)

This paper contains 32 sections, 15 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Demo of a textual query with color intention: bad cases (left red rectangles) with uni-EBS, and good cases (right green rectangles) with MRSE. The first bad case shows a failure to capture color preference, while the second involves an incorrect match with red guitar strings.
  • Figure 2: Architecture of MRSE: The multi-modality extraction comprises three LMoE modules, color-coded as shown in the top left corner. In the multi-modality fusion stage, $\varphi(\mathfrak{T})$ and $\psi(\mathfrak{Q})$ represent the final representations for the item tower and query tower, respectively. Key differences between the query and item towers are highlighted with a gray background.
  • Figure 3: Illustration of the proposed Hybrid Loss. The blue, green, and red cubes represent the contrast matrix, easy negatives, and hard negatives, respectively. $N$ denotes the batch size, $e$ and $h$ are the number of easy negatives and hard negatives.
  • Figure 4: Overview of MRSE System Architecture
  • Figure 5: (a) Convergence curves of Recall@500 for VBert using triplet loss, softmax loss, and our proposed hybrid loss. (b) The retrieval performance when training MRSE with different ratios of hard negatives rate ($\beta / \alpha$). We conduct two more evaluation datasets with query and its item pool, consisting with the easy negative (randomly picked) and hard negative (from the relevance stage).