Table of Contents
Fetching ...

Optimizing Product Deduplication in E-Commerce with Multimodal Embeddings

Aysenur Kulunk, Berk Taskin, M. Furkan Eseoglu, H. Bahadir Sahin

TL;DR

The paper tackles duplicate product listings in large Turkish e-commerce by building a domain-specific, multimodal deduplication system. It combines a Turkish BERTurk-based text encoder with a Masked AutoEncoder–based image encoder to produce compact 128-dimensional embeddings, and uses a dedicated decider that fuses text and image vectors for fast, category-agnostic classification. Through Milvus vector search with IVF_FLAT indexing, the approach achieves a macro-F1 of 0.90, outperforming a strong third-party baseline, while maintaining low memory and latency suitable for hundreds of millions of items. The work demonstrates scalable, efficient, and accurate deduplication with potential deployment at-scale (116M product vectors daily) and outlines future directions for richer multimodal integration and language expansion.

Abstract

In large scale e-commerce marketplaces, duplicate product listings frequently cause consumer confusion and operational inefficiencies, degrading trust on the platform and increasing costs. Traditional keyword-based search methodologies falter in accurately identifying duplicates due to their reliance on exact textual matches, neglecting semantic similarities inherent in product titles. To address these challenges, we introduce a scalable, multimodal product deduplication designed specifically for the e-commerce domain. Our approach employs a domain-specific text model grounded in BERT architecture in conjunction with MaskedAutoEncoders for image representations. Both of these architectures are augmented with dimensionality reduction techniques to produce compact 128-dimensional embeddings without significant information loss. Complementing this, we also developed a novel decider model that leverages both text and image vectors. By integrating these feature extraction mechanisms with Milvus, an optimized vector database, our system can facilitate efficient and high-precision similarity searches across extensive product catalogs exceeding 200 million items with just 100GB of system RAM consumption. Empirical evaluations demonstrate that our matching system achieves a macro-average F1 score of 0.90, outperforming third-party solutions which attain an F1 score of 0.83. Our findings show the potential of combining domain-specific adaptations with state-of-the-art machine learning techniques to mitigate duplicate listings in large-scale e-commerce environments.

Optimizing Product Deduplication in E-Commerce with Multimodal Embeddings

TL;DR

The paper tackles duplicate product listings in large Turkish e-commerce by building a domain-specific, multimodal deduplication system. It combines a Turkish BERTurk-based text encoder with a Masked AutoEncoder–based image encoder to produce compact 128-dimensional embeddings, and uses a dedicated decider that fuses text and image vectors for fast, category-agnostic classification. Through Milvus vector search with IVF_FLAT indexing, the approach achieves a macro-F1 of 0.90, outperforming a strong third-party baseline, while maintaining low memory and latency suitable for hundreds of millions of items. The work demonstrates scalable, efficient, and accurate deduplication with potential deployment at-scale (116M product vectors daily) and outlines future directions for richer multimodal integration and language expansion.

Abstract

In large scale e-commerce marketplaces, duplicate product listings frequently cause consumer confusion and operational inefficiencies, degrading trust on the platform and increasing costs. Traditional keyword-based search methodologies falter in accurately identifying duplicates due to their reliance on exact textual matches, neglecting semantic similarities inherent in product titles. To address these challenges, we introduce a scalable, multimodal product deduplication designed specifically for the e-commerce domain. Our approach employs a domain-specific text model grounded in BERT architecture in conjunction with MaskedAutoEncoders for image representations. Both of these architectures are augmented with dimensionality reduction techniques to produce compact 128-dimensional embeddings without significant information loss. Complementing this, we also developed a novel decider model that leverages both text and image vectors. By integrating these feature extraction mechanisms with Milvus, an optimized vector database, our system can facilitate efficient and high-precision similarity searches across extensive product catalogs exceeding 200 million items with just 100GB of system RAM consumption. Empirical evaluations demonstrate that our matching system achieves a macro-average F1 score of 0.90, outperforming third-party solutions which attain an F1 score of 0.83. Our findings show the potential of combining domain-specific adaptations with state-of-the-art machine learning techniques to mitigate duplicate listings in large-scale e-commerce environments.

Paper Structure

This paper contains 26 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Updated version of text model
  • Figure 2: (a) Second iteration of the image model P# denotes the selected patches, M# denotes the masked patches (b) Final iteration of the image model P# denotes the selected patches, C is the center patch, Xr is resized image
  • Figure 3: Final system design for product deduplication.
  • Figure 4: Scale differences of a product. (a) image does not have a white border, (b) image has a white border