Table of Contents
Fetching ...

Benchmarking Image Embeddings for E-Commerce: Evaluating Off-the Shelf Foundation Models, Fine-Tuning Strategies and Practical Trade-offs

Urszula Czerwinska, Cenk Bircanoglu, Jeremy Chamoux

TL;DR

This work systematically benchmarks image embeddings for e-Commerce across six diverse datasets, evaluating supervised, self-supervised, and contrastive text-image pretraining under full fine-tuning, top-tuning, and cross-tuning. It finds that while full fine-tuning generally yields the best performance, top-tuning—especially for SSL and text-image models—nearly matches or exceeds it at a fraction of the cost, with cross-tuning showing dataset-dependent benefits. The study provides practical guidelines balancing accuracy and efficiency for embedding selection and adaptation in industry settings, highlighting the strong retrieval performance of contrastive text-image models in pure image-to-image tasks. These insights enable more informed deployment of foundation-model embeddings in real-world e-Commerce systems, with implications for rapid prototyping and scalable fine-tuning strategies.

Abstract

We benchmark foundation models image embeddings for classification and retrieval in e-Commerce, evaluating their suitability for real-world applications. Our study spans embeddings from pre-trained convolutional and transformer models trained via supervised, self-supervised, and text-image contrastive learning. We assess full fine-tuning and transfer learning (top-tuning) on six diverse e-Commerce datasets: fashion, consumer goods, cars, food, and retail. Results show full fine-tuning consistently performs well, while text-image and self-supervised embeddings can match its performance with less training. While supervised embeddings remain stable across architectures, SSL and contrastive embeddings vary significantly, often benefiting from top-tuning. Top-tuning emerges as an efficient alternative to full fine-tuning, reducing computational costs. We also explore cross-tuning, noting its impact depends on dataset characteristics. Our findings offer practical guidelines for embedding selection and fine-tuning strategies, balancing efficiency and performance.

Benchmarking Image Embeddings for E-Commerce: Evaluating Off-the Shelf Foundation Models, Fine-Tuning Strategies and Practical Trade-offs

TL;DR

This work systematically benchmarks image embeddings for e-Commerce across six diverse datasets, evaluating supervised, self-supervised, and contrastive text-image pretraining under full fine-tuning, top-tuning, and cross-tuning. It finds that while full fine-tuning generally yields the best performance, top-tuning—especially for SSL and text-image models—nearly matches or exceeds it at a fraction of the cost, with cross-tuning showing dataset-dependent benefits. The study provides practical guidelines balancing accuracy and efficiency for embedding selection and adaptation in industry settings, highlighting the strong retrieval performance of contrastive text-image models in pure image-to-image tasks. These insights enable more informed deployment of foundation-model embeddings in real-world e-Commerce systems, with implications for rapid prototyping and scalable fine-tuning strategies.

Abstract

We benchmark foundation models image embeddings for classification and retrieval in e-Commerce, evaluating their suitability for real-world applications. Our study spans embeddings from pre-trained convolutional and transformer models trained via supervised, self-supervised, and text-image contrastive learning. We assess full fine-tuning and transfer learning (top-tuning) on six diverse e-Commerce datasets: fashion, consumer goods, cars, food, and retail. Results show full fine-tuning consistently performs well, while text-image and self-supervised embeddings can match its performance with less training. While supervised embeddings remain stable across architectures, SSL and contrastive embeddings vary significantly, often benefiting from top-tuning. Top-tuning emerges as an efficient alternative to full fine-tuning, reducing computational costs. We also explore cross-tuning, noting its impact depends on dataset characteristics. Our findings offer practical guidelines for embedding selection and fine-tuning strategies, balancing efficiency and performance.

Paper Structure

This paper contains 39 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: A high-level illustration of our experimental workflow. We evaluate pre-trained, fine-tuned, and top-tuned models on six e-Commerce datasets, assessing performance through retrieval. Additionally, pre-trained models undergo classification testing via top-tuning. All metrics are logged in an MLflow dashboard.
  • Figure 2: Models used in this study, showing the relationship between embedding size, FLOPs (B), and parameters (M). DINO, DINOv2, MAWS, MAE, and CLIP share a ViT-B architecture and are represented alongside the vanilla ViT.
  • Figure 3: Retrieval metric correlation. All retrieval metrics used in this study are highly correlated.
  • Figure 4: All models results. Retrieval performance comparison for all model types (mMP@5) (a), best models of each type hierarchy per dataset (b) and z-score normalized performance vs embedding size.
  • Figure 5: Detailed results of all pretrained models performance on each dataset.
  • ...and 2 more figures