Adapting Vision-Language Models for E-commerce Understanding at Scale
Matteo Nulli, Vladimir Orshulevich, Tala Bazazo, Christian Herold, Michael Kozielski, Marcin Mazur, Szymon Tuzel, Cees G. M. Snoek, Seyyed Hadi Hashemi, Omar Javed, Yannick Versley, Shahram Khadivi
TL;DR
This work tackles the gap between general vision-language models and the specialized, noisy, and multi-image data found in e-commerce. It presents a backbone-agnostic adaptation recipe that combines internal data curation, staged visual-language alignment, and visual instruction tuning to create e-commerce‑aware VLMs without sacrificing cross-domain capabilities. A four-part benchmark suite (Aspect Prediction, Deep Fashion Understanding, Dynamic Attribute Extraction, Multi-image Item Intelligence) evaluates in-domain performance across attribute extraction, fashion understanding, and regulatory-compliance tasks, while ablations reveal the importance of vision encoders, text decoders, and multi-image strategies. The results demonstrate substantial in-domain gains and efficient inference through cropping-based labeling and fine-tuning, with practical implications for production-ready, scalable e-commerce multimodal assistants. Overall, the paper provides a reproducible methodology and evaluation framework enabling robust e-commerce VLM deployment at scale, preserving broad multimodal capabilities.
Abstract
E-commerce product understanding demands by nature, strong multimodal comprehension from text, images, and structured attributes. General-purpose Vision-Language Models (VLMs) enable generalizable multimodal latent modelling, yet there is no documented, well-known strategy for adapting them to the attribute-centric, multi-image, and noisy nature of e-commerce data, without sacrificing general performance. In this work, we show through a large-scale experimental study, how targeted adaptation of general VLMs can substantially improve e-commerce performance while preserving broad multimodal capabilities. Furthermore, we propose a novel extensive evaluation suite covering deep product understanding, strict instruction following, and dynamic attribute extraction.
