Multimodal semantic retrieval for product search
Dong Liu, Esther Lopez Ramos
TL;DR
The paper addresses limitations of text-only semantic retrieval in e-commerce by incorporating visual information into multimodal product representations. It introduces two architectures, 4tMM and 3tMM, that fuse BiBERT text encoders with CLIP visual encoders, and studies multiple fusion strategies under cosine similarity with the NT-Xent loss for alignment. Experiments on large-scale training data demonstrate that multimodal models can improve either purchase recall or relevance accuracy, with 4tMM achieving stronger language–visual alignment and notable exclusive-match gains. The findings offer practical guidance for scalable, multimodal product retrieval in e-commerce and reveal substantial potential for novel, high-quality matches beyond text-only baselines through exclusive-match analysis.
Abstract
Semantic retrieval (also known as dense retrieval) based on textual data has been extensively studied for both web search and product search application fields, where the relevance of a query and a potential target document is computed by their dense vector representation comparison. Product image is crucial for e-commerce search interactions and is a key factor for customers at product explorations. However, its impact on semantic retrieval has not been well studied yet. In this research, we build a multimodal representation for product items in e-commerce search in contrast to pure-text representation of products, and investigate the impact of such representations. The models are developed and evaluated on e-commerce datasets. We demonstrate that a multimodal representation scheme for a product can show improvement either on purchase recall or relevance accuracy in semantic retrieval. Additionally, we provide numerical analysis for exclusive matches retrieved by a multimodal semantic retrieval model versus a text-only semantic retrieval model, to demonstrate the validation of multimodal solutions.
