Table of Contents
Fetching ...

MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding

Baixuan Xu, Weiqi Wang, Haochen Shi, Wenxuan Ding, Huihao Jing, Tianqing Fang, Jiaxin Bai, Xin Liu, Changlong Yu, Zheng Li, Chen Luo, Qingyu Yin, Bing Yin, Long Chen, Yangqiu Song

TL;DR

Mind delivers a scalable solution to acquire purchase intentions by distilling multimodal signals from co-buy records and product imagery via LVLMs. It constructs a large multimodal intention knowledge base (over 1.26 million intentions across 107k products) and employs a role-aware filtering step to ensure human-centric grounding, reducing product-centric bias. Intrinsic evaluation confirms high plausibility and typicality, while downstream experiments on the IntentionQA benchmark show that fine-tuning with Mind-derived intents improves intention understanding and utilization, outperforming text-only baselines and showcasing robustness and diversity. The approach advances E-commerce purchase understanding by integrating visual cues, commonsense grounding, and automatic quality control, with practical implications for recommendation and search systems.

Abstract

Improving user experience and providing personalized search results in E-commerce platforms heavily rely on understanding purchase intention. However, existing methods for acquiring large-scale intentions bank on distilling large language models with human annotation for verification. Such an approach tends to generate product-centric intentions, overlook valuable visual information from product images, and incurs high costs for scalability. To address these issues, we introduce MIND, a multimodal framework that allows Large Vision-Language Models (LVLMs) to infer purchase intentions from multimodal product metadata and prioritize human-centric ones. Using Amazon Review data, we apply MIND and create a multimodal intention knowledge base, which contains 1,264,441 million intentions derived from 126,142 co-buy shopping records across 107,215 products. Extensive human evaluations demonstrate the high plausibility and typicality of our obtained intentions and validate the effectiveness of our distillation framework and filtering mechanism. Additional experiments reveal that our obtained intentions significantly enhance large language models in two intention comprehension tasks.

MIND: Multimodal Shopping Intention Distillation from Large Vision-language Models for E-commerce Purchase Understanding

TL;DR

Mind delivers a scalable solution to acquire purchase intentions by distilling multimodal signals from co-buy records and product imagery via LVLMs. It constructs a large multimodal intention knowledge base (over 1.26 million intentions across 107k products) and employs a role-aware filtering step to ensure human-centric grounding, reducing product-centric bias. Intrinsic evaluation confirms high plausibility and typicality, while downstream experiments on the IntentionQA benchmark show that fine-tuning with Mind-derived intents improves intention understanding and utilization, outperforming text-only baselines and showcasing robustness and diversity. The approach advances E-commerce purchase understanding by integrating visual cues, commonsense grounding, and automatic quality control, with practical implications for recommendation and search systems.

Abstract

Improving user experience and providing personalized search results in E-commerce platforms heavily rely on understanding purchase intention. However, existing methods for acquiring large-scale intentions bank on distilling large language models with human annotation for verification. Such an approach tends to generate product-centric intentions, overlook valuable visual information from product images, and incurs high costs for scalability. To address these issues, we introduce MIND, a multimodal framework that allows Large Vision-Language Models (LVLMs) to infer purchase intentions from multimodal product metadata and prioritize human-centric ones. Using Amazon Review data, we apply MIND and create a multimodal intention knowledge base, which contains 1,264,441 million intentions derived from 126,142 co-buy shopping records across 107,215 products. Extensive human evaluations demonstrate the high plausibility and typicality of our obtained intentions and validate the effectiveness of our distillation framework and filtering mechanism. Additional experiments reveal that our obtained intentions significantly enhance large language models in two intention comprehension tasks.
Paper Structure (35 sections, 7 figures, 5 tables)

This paper contains 35 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Examples showing the process of distilling purchase intentions from large language models and large vision-language models. Without product images, large language models tend to generate intentions with low typicality and hallucinated facts, while leveraging large vision-language models resolve such issue.
  • Figure 2: An overview of Mind. We first extract features from products in real-world co-buy records, generate intentions multimodally, and apply a human-centric role-aware filter for quality optimization.
  • Figure 3: Distribution of hypernyms sourced from Probase in Mind with top frequencies.
  • Figure 4: Ablation results on IntentionQA tasks by Mistral-7B distilled on intentions generated by Mind with/without filtering.
  • Figure 5: Relation-wise comparison of typicality scores across all relations between Mind and FolkScope.
  • ...and 2 more figures