Text-Based Product Matching -- Semi-Supervised Clustering Approach
Alicja Martinek, Szymon Łukasik, Amir H. Gandomi
TL;DR
This work tackles product matching across feeds by reframing the task as a semi-supervised constrained clustering problem using Deep Embedded Clustering (IDEC). It constructs a 5-feature textual similarity vector from fuzzy string metrics and distances, and enforces Must Link and Can't Link constraints to guide clustering toward two distinct groups ($2$ clusters) corresponding to matching vs. non-matching pairs. Across the Skroutz dataset and additional camera-related corpora, IDEC with carefully chosen constraints consistently outperforms baselines like k-means, XGBoost, and DeepMatcher, achieving a peak $F_1$ score around 0.917 and high Rand Index, indicating strong pairwise agreement with ground truth. The results suggest that semi-supervised, text-centric product matching can reduce labeling requirements while delivering robust performance on real-world, noisy product data, with potential for broader deployment and extensions to other constrained clustering algorithms and richer feature sets.
Abstract
Matching identical products present in multiple product feeds constitutes a crucial element of many tasks of e-commerce, such as comparing product offerings, dynamic price optimization, and selecting the assortment personalized for the client. It corresponds to the well-known machine learning task of entity matching, with its own specificity, like omnipresent unstructured data or inaccurate and inconsistent product descriptions. This paper aims to present a new philosophy to product matching utilizing a semi-supervised clustering approach. We study the properties of this method by experimenting with the IDEC algorithm on the real-world dataset using predominantly textual features and fuzzy string matching, with more standard approaches as a point of reference. Encouraging results show that unsupervised matching, enriched with a small annotated sample of product links, could be a possible alternative to the dominant supervised strategy, requiring extensive manual data labeling.
