Table of Contents
Fetching ...

Cross-Platform E-Commerce Product Categorization and Recategorization: A Multimodal Hierarchical Classification Approach

Lotte Gross, Rebecca Walter, Nicole Zoppi, Adrien Justus, Alessandro Gambetti, Qiwei Han, Maximilian Kaiser

TL;DR

The paper tackles cross-platform e-commerce product categorization by developing a multimodal hierarchical framework that fuses textual, visual, and vision-language signals, coupled with dynamic masking to maintain taxonomic validity. It demonstrates that CLIP-based late fusion delivers the strongest hierarchical performance while a two-stage deployment (RoBERTa followed by a GPU-accelerated multimodal stage) balances accuracy and cost for industrial use. A self-supervised recategorization pipeline using SimCLR, UMAP, and cascade clustering discovers fine-grained subcategories (e.g., within Shoes) and generalizes across platforms, reducing manual taxonomy maintenance. The work confirms the practicality of deploying scalable, robust, cross-platform categorization pipelines in production environments (EURWEB) and outlines a path for taxonomy evolution aligned with dynamic market trends.

Abstract

This study addresses critical industrial challenges in e-commerce product categorization, namely platform heterogeneity and the structural limitations of existing taxonomies, by developing and deploying a multimodal hierarchical classification framework. Using a dataset of 271,700 products from 40 international fashion e-commerce platforms, we integrate textual features (RoBERTa), visual features (ViT), and joint vision-language representations (CLIP). We investigate fusion strategies, including early, late, and attention-based fusion within a hierarchical architecture enhanced by dynamic masking to ensure taxonomic consistency. Results show that CLIP embeddings combined via an MLP-based late-fusion strategy achieve the highest hierarchical F1 (98.59%), outperforming unimodal baselines. To address shallow or inconsistent categories, we further introduce a self-supervised "product recategorization" pipeline using SimCLR, UMAP, and cascade clustering, which discovered new, fine-grained categories (for example, subtypes of "Shoes") with cluster purities above 86%. Cross-platform experiments reveal a deployment-relevant trade-off: complex late-fusion methods maximize accuracy with diverse training data, while simpler early-fusion methods generalize more effectively to unseen platforms. Finally, we demonstrate the framework's industrial scalability through deployment in EURWEB's commercial transaction intelligence platform via a two-stage inference pipeline, combining a lightweight RoBERTa stage with a GPU-accelerated multimodal stage to balance cost and accuracy.

Cross-Platform E-Commerce Product Categorization and Recategorization: A Multimodal Hierarchical Classification Approach

TL;DR

The paper tackles cross-platform e-commerce product categorization by developing a multimodal hierarchical framework that fuses textual, visual, and vision-language signals, coupled with dynamic masking to maintain taxonomic validity. It demonstrates that CLIP-based late fusion delivers the strongest hierarchical performance while a two-stage deployment (RoBERTa followed by a GPU-accelerated multimodal stage) balances accuracy and cost for industrial use. A self-supervised recategorization pipeline using SimCLR, UMAP, and cascade clustering discovers fine-grained subcategories (e.g., within Shoes) and generalizes across platforms, reducing manual taxonomy maintenance. The work confirms the practicality of deploying scalable, robust, cross-platform categorization pipelines in production environments (EURWEB) and outlines a path for taxonomy evolution aligned with dynamic market trends.

Abstract

This study addresses critical industrial challenges in e-commerce product categorization, namely platform heterogeneity and the structural limitations of existing taxonomies, by developing and deploying a multimodal hierarchical classification framework. Using a dataset of 271,700 products from 40 international fashion e-commerce platforms, we integrate textual features (RoBERTa), visual features (ViT), and joint vision-language representations (CLIP). We investigate fusion strategies, including early, late, and attention-based fusion within a hierarchical architecture enhanced by dynamic masking to ensure taxonomic consistency. Results show that CLIP embeddings combined via an MLP-based late-fusion strategy achieve the highest hierarchical F1 (98.59%), outperforming unimodal baselines. To address shallow or inconsistent categories, we further introduce a self-supervised "product recategorization" pipeline using SimCLR, UMAP, and cascade clustering, which discovered new, fine-grained categories (for example, subtypes of "Shoes") with cluster purities above 86%. Cross-platform experiments reveal a deployment-relevant trade-off: complex late-fusion methods maximize accuracy with diverse training data, while simpler early-fusion methods generalize more effectively to unseen platforms. Finally, we demonstrate the framework's industrial scalability through deployment in EURWEB's commercial transaction intelligence platform via a two-stage inference pipeline, combining a lightweight RoBERTa stage with a GPU-accelerated multimodal stage to balance cost and accuracy.

Paper Structure

This paper contains 20 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the Google Product Taxonomy structure, showing imbalanced hierarchical levels for Clothing (5 levels) and Shoes (2 levels).
  • Figure 2: Overview of the proposed methodology, illustrating the pipeline from cross-platform data input through multimodal feature extraction (RoBERTa, ViT, CLIP), multiple fusion strategies, and hierarchical classification, alongside the parallel process for product recategorization using SimCLR and cascade clustering.
  • Figure 3: Hierarchical model architecture. A shared layer processes the multimodal embedding. Subsequent layers combine shared representations with coarser predictions before making finer predictions.
  • Figure 4: Dynamic masking mechanism applied between hierarchical prediction layers. Each coarser-level prediction constrains the candidate set at the next level, ensuring valid taxonomy paths.
  • Figure 5: Illustrative example of recategorization for "Open Shoes," "Sneakers," and "Sport Shoes" derived from the broader "Shoes" category. This process demonstrates how the pipeline refines shallow taxonomies into more granular, industrially useful categories.