Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data

Xinyi Ling; Hanwen Du; Bo Peng; Zhihui Zhu; Xia Ning

Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data

Xinyi Ling, Hanwen Du, Bo Peng, Zhihui Zhu, Xia Ning

TL;DR

The paper tackles the lack of large-scale multimodal benchmarks and effective multimodal integration in e-commerce foundation models by introducing MMECInstruct, a multimodal instruction dataset with 75K samples across seven tasks, and CASLIE, a lightweight framework that converts visual content into context-conditioned captions (EC^3), evaluates caption relevance (CQE), and fuses textual cues (uniM^3) for downstream tasks. Fine-tuning CASLIE models on MMECInstruct yields substantial improvements over diverse baselines, with strong in-domain performance and notable out-of-domain generalization, demonstrating the utility of context-aware visual representations and selective information use. The work provides a modular, plug-and-play approach that leverages world knowledge from LLMs and enables scalable deployment in real-world e-commerce systems. Open-sourced resources and extensive ablations underscore the practicality and versatility of the CASLIE architecture for multimodal e-commerce tasks, setting a new benchmark for future research and deployment.

Abstract

Leveraging multimodal data to drive breakthroughs in e-commerce applications through Multimodal Foundation Models (MFMs) is gaining increasing attention from the research community. However, there are significant challenges that hinder the optimal use of multimodal e-commerce data by foundation models: (1) the scarcity of large-scale, high-quality multimodal benchmark datasets; and (2) the lack of effective multimodal information integration methods. To address these challenges, in this paper, we introduce MMECInstruct, the first-ever, large-scale, and high-quality multimodal instruction dataset for e-commerce. We also develop CASLIE, a simple, lightweight, yet effective framework for integrating multimodal information for e-commerce. Leveraging MMECInstruct, we fine-tune a series of e-commerce MFMs within CASLIE, denoted as CASLIE models. Our comprehensive evaluation demonstrates that CASLIE models substantially outperform 5 categories of advanced baseline models in the in-domain evaluation. Moreover, CASLIE models show strong generalizability to out-of-domain settings. MMECInstruct and CASLIE models are publicly accessible through https://ninglab.github.io/CASLIE/.

Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data

TL;DR

Abstract

Captions Speak Louder than Images: Generalizing Foundation Models for E-commerce from High-quality Multimodal Instruction Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)