PUMGPT: A Large Vision-Language Model for Product Understanding

Wei Xue; Zongyi Guo; Baoliang Cui; Zheng Xing; Xiaoyi Zeng; Xiufei Wang; Shuhui Wu; Weiming Lu

PUMGPT: A Large Vision-Language Model for Product Understanding

Wei Xue, Zongyi Guo, Baoliang Cui, Zheng Xing, Xiaoyi Zeng, Xiufei Wang, Shuhui Wu, Weiming Lu

TL;DR

The paper tackles the challenge of reliable, domain-specific product understanding in e-commerce by introducing PumGPT, a large vision-language model trained on a hallucination-filtered dataset of ~663k high-quality AliExpress products (from an original ~1M). It presents a universal multi-expert framework to detect and filter inconsistent attributes, enabling robust attribute inference and correction within five practical tasks (CG, CMC, AI, CC, AC) and introduces PumBench to evaluate LVLM performance in this domain. Empirical results show PumGPT outperforms five open-source LVLMs and GPT-4V across tasks, with notable gains in attribute inference and rejection handling, and strong domain-level performance, especially in handling standardized vs non-standardized attributes. The work underscores the importance of domain specialization and data-quality controls for practical e-commerce workflows, and outlines future improvements in task diversity and data quality to further enhance performance.

Abstract

E-commerce platforms benefit from accurate product understanding to enhance user experience and operational efficiency. Traditional methods often focus on isolated tasks such as attribute extraction or categorization, posing adaptability issues to evolving tasks and leading to usability challenges with noisy data from the internet. Current Large Vision Language Models (LVLMs) lack domain-specific fine-tuning, thus falling short in precision and instruction following. To address these issues, we introduce PumGPT, the first e-commerce specialized LVLM designed for multi-modal product understanding tasks. We collected and curated a dataset of over one million products from AliExpress, filtering out non-inferable attributes using a universal hallucination detection framework, resulting in 663k high-quality data samples. PumGPT focuses on five essential tasks aimed at enhancing workflows for e-commerce platforms and retailers. We also introduce PumBench, a benchmark to evaluate product understanding across LVLMs. Our experiments show that PumGPT outperforms five other open-source LVLMs and GPT-4V in product understanding tasks. We also conduct extensive analytical experiments to delve deeply into the superiority of PumGPT, demonstrating the necessity for a specialized model in the e-commerce domain.

PUMGPT: A Large Vision-Language Model for Product Understanding

TL;DR

Abstract

Paper Structure (20 sections, 1 equation, 4 figures, 10 tables)

This paper contains 20 sections, 1 equation, 4 figures, 10 tables.

Introduction
Related Works
PumGPT
Data Collection
Hallucination Filtering
Product Understanding Tasks Formulation
Benchmarking on Product Understanding Tasks
Implementation details and baselines
Datasets and metrics
Experimantal Results
Main Results on PumBench
Domain-level Results on Attribute Inference
Ablation on Hallucination Filtering
Evaluation on Rejection Ability
Case Study
...and 5 more sections

Figures (4)

Figure 1: A glimpse on PumGPT in product understanding.
Figure 2: The overview of our proposed hallucination detection framework.
Figure 3: Most common attribute names and proportion of 8 primary categories.
Figure 4: Ablation on hallucination filtering. Here we report the accuracy of the attribution inference task, where w Hallu means it was trained on the hallucination dataset and w/o Hallu means was trained on the hallucination-free dataset.

PUMGPT: A Large Vision-Language Model for Product Understanding

TL;DR

Abstract

PUMGPT: A Large Vision-Language Model for Product Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (4)