Text-Based Product Matching -- Semi-Supervised Clustering Approach

Alicja Martinek; Szymon Łukasik; Amir H. Gandomi

Text-Based Product Matching -- Semi-Supervised Clustering Approach

Alicja Martinek, Szymon Łukasik, Amir H. Gandomi

TL;DR

This work tackles product matching across feeds by reframing the task as a semi-supervised constrained clustering problem using Deep Embedded Clustering (IDEC). It constructs a 5-feature textual similarity vector from fuzzy string metrics and distances, and enforces Must Link and Can't Link constraints to guide clustering toward two distinct groups ($2$ clusters) corresponding to matching vs. non-matching pairs. Across the Skroutz dataset and additional camera-related corpora, IDEC with carefully chosen constraints consistently outperforms baselines like k-means, XGBoost, and DeepMatcher, achieving a peak $F_1$ score around 0.917 and high Rand Index, indicating strong pairwise agreement with ground truth. The results suggest that semi-supervised, text-centric product matching can reduce labeling requirements while delivering robust performance on real-world, noisy product data, with potential for broader deployment and extensions to other constrained clustering algorithms and richer feature sets.

Abstract

Matching identical products present in multiple product feeds constitutes a crucial element of many tasks of e-commerce, such as comparing product offerings, dynamic price optimization, and selecting the assortment personalized for the client. It corresponds to the well-known machine learning task of entity matching, with its own specificity, like omnipresent unstructured data or inaccurate and inconsistent product descriptions. This paper aims to present a new philosophy to product matching utilizing a semi-supervised clustering approach. We study the properties of this method by experimenting with the IDEC algorithm on the real-world dataset using predominantly textual features and fuzzy string matching, with more standard approaches as a point of reference. Encouraging results show that unsupervised matching, enriched with a small annotated sample of product links, could be a possible alternative to the dominant supervised strategy, requiring extensive manual data labeling.

Text-Based Product Matching -- Semi-Supervised Clustering Approach

TL;DR

clusters) corresponding to matching vs. non-matching pairs. Across the Skroutz dataset and additional camera-related corpora, IDEC with carefully chosen constraints consistently outperforms baselines like k-means, XGBoost, and DeepMatcher, achieving a peak

score around 0.917 and high Rand Index, indicating strong pairwise agreement with ground truth. The results suggest that semi-supervised, text-centric product matching can reduce labeling requirements while delivering robust performance on real-world, noisy product data, with potential for broader deployment and extensions to other constrained clustering algorithms and richer feature sets.

Abstract

Paper Structure (13 sections, 4 equations, 5 figures, 3 tables)

This paper contains 13 sections, 4 equations, 5 figures, 3 tables.

Introduction
Related Work
Product matching
Transforming textual data into numerical features
Clustering
Evaluation metrics
Proposed Algorithm
Experimental Settings and Results
Constraints impact
Comparison with other methods
Other datasets
Class distribution impact
Conclusion

Figures (5)

Figure 1: Example of Must Link (solid line) and Can't Link (dashed line) constraints
Figure 2: Impact of increasing the amount of Must Link Constraints
Figure 3: Impact of increasing the amount of Can't Link Constraints
Figure 4: Impact of increasing the amount of 1-1 pairs in Must Link Constraints
Figure 5: Impact of increasing the amount of matching pairs in the dataset

Text-Based Product Matching -- Semi-Supervised Clustering Approach

TL;DR

Abstract

Text-Based Product Matching -- Semi-Supervised Clustering Approach

Authors

TL;DR

Abstract

Table of Contents

Figures (5)