Table of Contents
Fetching ...

Adapting Vision-Language Models for E-commerce Understanding at Scale

Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi

TL;DR

This work tackles the gap between general vision-language models and the specialized, noisy, and multi-image data found in e-commerce. It presents a backbone-agnostic adaptation recipe that combines internal data curation, staged visual-language alignment, and visual instruction tuning to create e-commerce‑aware VLMs without sacrificing cross-domain capabilities. A four-part benchmark suite (Aspect Prediction, Deep Fashion Understanding, Dynamic Attribute Extraction, Multi-image Item Intelligence) evaluates in-domain performance across attribute extraction, fashion understanding, and regulatory-compliance tasks, while ablations reveal the importance of vision encoders, text decoders, and multi-image strategies. The results demonstrate substantial in-domain gains and efficient inference through cropping-based labeling and fine-tuning, with practical implications for production-ready, scalable e-commerce multimodal assistants. Overall, the paper provides a reproducible methodology and evaluation framework enabling robust e-commerce VLM deployment at scale, preserving broad multimodal capabilities.

Abstract

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.

Adapting Vision-Language Models for E-commerce Understanding at Scale

TL;DR

This work tackles the gap between general vision-language models and the specialized, noisy, and multi-image data found in e-commerce. It presents a backbone-agnostic adaptation recipe that combines internal data curation, staged visual-language alignment, and visual instruction tuning to create e-commerce‑aware VLMs without sacrificing cross-domain capabilities. A four-part benchmark suite (Aspect Prediction, Deep Fashion Understanding, Dynamic Attribute Extraction, Multi-image Item Intelligence) evaluates in-domain performance across attribute extraction, fashion understanding, and regulatory-compliance tasks, while ablations reveal the importance of vision encoders, text decoders, and multi-image strategies. The results demonstrate substantial in-domain gains and efficient inference through cropping-based labeling and fine-tuning, with practical implications for production-ready, scalable e-commerce multimodal assistants. Overall, the paper provides a reproducible methodology and evaluation framework enabling robust e-commerce VLM deployment at scale, preserving broad multimodal capabilities.

Abstract

E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
Paper Structure (50 sections, 6 figures, 5 tables)

This paper contains 50 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Output of our E-commerce Adapted VLMs compared against same size LLaVA-OneVision. We show our models ability to more faithfully extract attributes from e-commerce items. In red, we highlight wrong model predictions that are neither tied to the image nor valid item attributes.
  • Figure 2: Visual Verification Pipeline. The figure shows the pipeline we use to create the 4M e-commerce visual instruction tuning data. We begin by collecting raw listings data from the web (left). We then clean and pre-process the textual entries. In parallel, we create detailed captions for the corresponding image through InternVL-2.5-26B. Finally, we provide the captions together with the cleaned listings to Mistral-Small-3-24B to obtain the verified instructions, used, along with original images, to train our models (shown with fire).
  • Figure 3: eBay Single-Image Visual Instruction Tuning Set. We break down the components of our internal single-image instruction tuning set. The pie chart on the left shows the percentages of tasks in our set. On the right we breakdown each tasks with its own sub tasks with the total number of instructions in parenthesis.
  • Figure 4: Benchmark examples from Aspect Prediction and Deep Fashion Understanding. We choose a representative example from our Aspect Prediction and Deep Fashion Understanding benchmarks to showcase the tasks in detail.
  • Figure 5: Benchmark example from Dynamic Attribute Extraction. We choose a representative example from our Dynamic Attribute Extraction benchmark to showcase the task in detail.
  • ...and 1 more figures