Table of Contents
Fetching ...

End-to-end multi-modal product matching in fashion e-commerce

Sándor Tóth, Stephen Wilson, Alexia Tsoukara, Enric Moreu, Anton Masalovich, Lars Roemheld

TL;DR

This work tackles end-to-end, multi-modal product matching in fashion e-commerce under cross-domain distribution shifts. It proposes fashionID, a two-stage system that encodes offers with image, text, and numerical features and retrieves matches via nearest-neighbor search in a learned embedding space, trained with large-batch contrastive learning. Key findings show CLIP-based encoders outperform DINO and offerDNA baselines, with image signals driving performance and modest gains from numerical features; a large-batch, linear-projection approach provides strong generalization and production efficiency. The authors demonstrate a practical HITL workflow that substantially improves precision in production, illustrating a scalable, cost-aware path to industry-ready multi-modal matching for fashion catalogs.

Abstract

Product matching, the task of identifying different representations of the same product for better discoverability, curation, and pricing, is a key capability for online marketplace and e-commerce companies. We present a robust multi-modal product matching system in an industry setting, where large datasets, data distribution shifts and unseen domains pose challenges. We compare different approaches and conclude that a relatively straightforward projection of pretrained image and text encoders, trained through contrastive learning, yields state-of-the-art results, while balancing cost and performance. Our solution outperforms single modality matching systems and large pretrained models, such as CLIP. Furthermore we show how a human-in-the-loop process can be combined with model-based predictions to achieve near perfect precision in a production system.

End-to-end multi-modal product matching in fashion e-commerce

TL;DR

This work tackles end-to-end, multi-modal product matching in fashion e-commerce under cross-domain distribution shifts. It proposes fashionID, a two-stage system that encodes offers with image, text, and numerical features and retrieves matches via nearest-neighbor search in a learned embedding space, trained with large-batch contrastive learning. Key findings show CLIP-based encoders outperform DINO and offerDNA baselines, with image signals driving performance and modest gains from numerical features; a large-batch, linear-projection approach provides strong generalization and production efficiency. The authors demonstrate a practical HITL workflow that substantially improves precision in production, illustrating a scalable, cost-aware path to industry-ready multi-modal matching for fashion catalogs.

Abstract

Product matching, the task of identifying different representations of the same product for better discoverability, curation, and pricing, is a key capability for online marketplace and e-commerce companies. We present a robust multi-modal product matching system in an industry setting, where large datasets, data distribution shifts and unseen domains pose challenges. We compare different approaches and conclude that a relatively straightforward projection of pretrained image and text encoders, trained through contrastive learning, yields state-of-the-art results, while balancing cost and performance. Our solution outperforms single modality matching systems and large pretrained models, such as CLIP. Furthermore we show how a human-in-the-loop process can be combined with model-based predictions to achieve near perfect precision in a production system.
Paper Structure (14 sections, 5 equations, 7 figures, 5 tables)

This paper contains 14 sections, 5 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: The setup of the multi-modal fashionID encoder.
  • Figure 2: Architecture of the retrieval system.
  • Figure 3: Precision-recall curves for fashionID evaluated on the most common fashion categories of the in-domain test set. Black dots denote the precision - recall values at the fixed similarity score of 0.80 per category.
  • Figure 4: Comparison of matching performance as a function of pretrained model parameter count. We show the parameter count of (frozen) CLIP encoders with and without linear projection and the small fully fine-tuned offerDNA, "in" / "out" denotes results on the in-domain and out-domain test sets, respectively.
  • Figure 5: CLIP(ViT-bigG-14) image encoder with linear projection performance as a function of training mini-batch size.
  • ...and 2 more figures