Table of Contents
Fetching ...

ITEm: Unsupervised Image-Text Embedding Learning for eCommerce

Baohao Liao, Michael Kozielski, Sanjika Hewavitharana, Jiangbo Yuan, Shahram Khadivi, Tomer Lancewicki

TL;DR

The paper tackles cross-modal product embedding for eCommerce by proposing ITEm, a unsupervised, single-stream transformer that learns balanced image-text representations without ROI guidance. It introduces five pre-training objectives—Image-Text Matching, Masked Language Modeling, Masked Image Modeling, and their global-information variants—to fuse image and title information effectively. Using the ITOP dataset, ITEm demonstrates state-of-the-art performance on fine-grained tasks: same product retrieval and leaf-category prediction, outperforming uni-modal and some multi-modal baselines. The work advances practical multi-modal embeddings for eCommerce, promoting robust retrieval and classification with potential for public data release and broader application beyond eCommerce.

Abstract

Product embedding serves as a cornerstone for a wide range of applications in eCommerce. The product embedding learned from multiple modalities shows significant improvement over that from a single modality, since different modalities provide complementary information. However, some modalities are more informatively dominant than others. How to teach a model to learn embedding from different modalities without neglecting information from the less dominant modality is challenging. We present an image-text embedding model (ITEm), an unsupervised learning method that is designed to better attend to image and text modalities. We extend BERT by (1) learning an embedding from text and image without knowing the regions of interest; (2) training a global representation to predict masked words and to construct masked image patches without their individual representations. We evaluate the pre-trained ITEm on two tasks: the search for extremely similar products and the prediction of product categories, showing substantial gains compared to strong baseline models.

ITEm: Unsupervised Image-Text Embedding Learning for eCommerce

TL;DR

The paper tackles cross-modal product embedding for eCommerce by proposing ITEm, a unsupervised, single-stream transformer that learns balanced image-text representations without ROI guidance. It introduces five pre-training objectives—Image-Text Matching, Masked Language Modeling, Masked Image Modeling, and their global-information variants—to fuse image and title information effectively. Using the ITOP dataset, ITEm demonstrates state-of-the-art performance on fine-grained tasks: same product retrieval and leaf-category prediction, outperforming uni-modal and some multi-modal baselines. The work advances practical multi-modal embeddings for eCommerce, promoting robust retrieval and classification with potential for public data release and broader application beyond eCommerce.

Abstract

Product embedding serves as a cornerstone for a wide range of applications in eCommerce. The product embedding learned from multiple modalities shows significant improvement over that from a single modality, since different modalities provide complementary information. However, some modalities are more informatively dominant than others. How to teach a model to learn embedding from different modalities without neglecting information from the less dominant modality is challenging. We present an image-text embedding model (ITEm), an unsupervised learning method that is designed to better attend to image and text modalities. We extend BERT by (1) learning an embedding from text and image without knowing the regions of interest; (2) training a global representation to predict masked words and to construct masked image patches without their individual representations. We evaluate the pre-trained ITEm on two tasks: the search for extremely similar products and the prediction of product categories, showing substantial gains compared to strong baseline models.
Paper Structure (16 sections, 5 equations, 9 figures, 3 tables)

This paper contains 16 sections, 5 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Examples from ITOP.
  • Figure 2: Index set distribution over meta category.
  • Figure 3: A search example of ITOP: we show a query on the left with its two true matches and two distractors on the right. Distractors are "hard" examples because they all come from the same leaf category as the query, i.e. "Care Bears", yet only the true matches share the same product model.
  • Figure 4: ITEm Architecture. ITEm is pre-trained with five objectives: image-text matching (ITM), masked image modeling (MIM), masked image modeling based on global information (GMIM), masked language modeling (MLM) and masked language modeling based on global information (GMLM). Image patches or tokens are randomly sampled to be masked without knowing regions of interest.
  • Figure 5: Top 5 examples for the same product recommendation task. The product on the top left is the query. Products in the blue boxes are matching products, while the others are distractors.
  • ...and 4 more figures