Table of Contents
Fetching ...

Text-Based Product Matching -- Semi-Supervised Clustering Approach

Alicja Martinek, Szymon Łukasik, Amir H. Gandomi

TL;DR

This work tackles product matching across feeds by reframing the task as a semi-supervised constrained clustering problem using Deep Embedded Clustering (IDEC). It constructs a 5-feature textual similarity vector from fuzzy string metrics and distances, and enforces Must Link and Can't Link constraints to guide clustering toward two distinct groups ($2$ clusters) corresponding to matching vs. non-matching pairs. Across the Skroutz dataset and additional camera-related corpora, IDEC with carefully chosen constraints consistently outperforms baselines like k-means, XGBoost, and DeepMatcher, achieving a peak $F_1$ score around 0.917 and high Rand Index, indicating strong pairwise agreement with ground truth. The results suggest that semi-supervised, text-centric product matching can reduce labeling requirements while delivering robust performance on real-world, noisy product data, with potential for broader deployment and extensions to other constrained clustering algorithms and richer feature sets.

Abstract

Matching identical products present in multiple product feeds constitutes a crucial element of many tasks of e-commerce, such as comparing product offerings, dynamic price optimization, and selecting the assortment personalized for the client. It corresponds to the well-known machine learning task of entity matching, with its own specificity, like omnipresent unstructured data or inaccurate and inconsistent product descriptions. This paper aims to present a new philosophy to product matching utilizing a semi-supervised clustering approach. We study the properties of this method by experimenting with the IDEC algorithm on the real-world dataset using predominantly textual features and fuzzy string matching, with more standard approaches as a point of reference. Encouraging results show that unsupervised matching, enriched with a small annotated sample of product links, could be a possible alternative to the dominant supervised strategy, requiring extensive manual data labeling.

Text-Based Product Matching -- Semi-Supervised Clustering Approach

TL;DR

This work tackles product matching across feeds by reframing the task as a semi-supervised constrained clustering problem using Deep Embedded Clustering (IDEC). It constructs a 5-feature textual similarity vector from fuzzy string metrics and distances, and enforces Must Link and Can't Link constraints to guide clustering toward two distinct groups ( clusters) corresponding to matching vs. non-matching pairs. Across the Skroutz dataset and additional camera-related corpora, IDEC with carefully chosen constraints consistently outperforms baselines like k-means, XGBoost, and DeepMatcher, achieving a peak score around 0.917 and high Rand Index, indicating strong pairwise agreement with ground truth. The results suggest that semi-supervised, text-centric product matching can reduce labeling requirements while delivering robust performance on real-world, noisy product data, with potential for broader deployment and extensions to other constrained clustering algorithms and richer feature sets.

Abstract

Matching identical products present in multiple product feeds constitutes a crucial element of many tasks of e-commerce, such as comparing product offerings, dynamic price optimization, and selecting the assortment personalized for the client. It corresponds to the well-known machine learning task of entity matching, with its own specificity, like omnipresent unstructured data or inaccurate and inconsistent product descriptions. This paper aims to present a new philosophy to product matching utilizing a semi-supervised clustering approach. We study the properties of this method by experimenting with the IDEC algorithm on the real-world dataset using predominantly textual features and fuzzy string matching, with more standard approaches as a point of reference. Encouraging results show that unsupervised matching, enriched with a small annotated sample of product links, could be a possible alternative to the dominant supervised strategy, requiring extensive manual data labeling.
Paper Structure (13 sections, 4 equations, 5 figures, 3 tables)

This paper contains 13 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Example of Must Link (solid line) and Can't Link (dashed line) constraints
  • Figure 2: Impact of increasing the amount of Must Link Constraints
  • Figure 3: Impact of increasing the amount of Can't Link Constraints
  • Figure 4: Impact of increasing the amount of 1-1 pairs in Must Link Constraints
  • Figure 5: Impact of increasing the amount of matching pairs in the dataset