Table of Contents
Fetching ...

PUMGPT: A Large Vision-Language Model for Product Understanding

Wei Xue, Zongyi Guo, Baoliang Cui, Zheng Xing, Xiaoyi Zeng, Xiufei Wang, Shuhui Wu, Weiming Lu

TL;DR

The paper tackles the challenge of reliable, domain-specific product understanding in e-commerce by introducing PumGPT, a large vision-language model trained on a hallucination-filtered dataset of ~663k high-quality AliExpress products (from an original ~1M). It presents a universal multi-expert framework to detect and filter inconsistent attributes, enabling robust attribute inference and correction within five practical tasks (CG, CMC, AI, CC, AC) and introduces PumBench to evaluate LVLM performance in this domain. Empirical results show PumGPT outperforms five open-source LVLMs and GPT-4V across tasks, with notable gains in attribute inference and rejection handling, and strong domain-level performance, especially in handling standardized vs non-standardized attributes. The work underscores the importance of domain specialization and data-quality controls for practical e-commerce workflows, and outlines future improvements in task diversity and data quality to further enhance performance.

Abstract

E-commerce platforms benefit from accurate product understanding to enhance user experience and operational efficiency. Traditional methods often focus on isolated tasks such as attribute extraction or categorization, posing adaptability issues to evolving tasks and leading to usability challenges with noisy data from the internet. Current Large Vision Language Models (LVLMs) lack domain-specific fine-tuning, thus falling short in precision and instruction following. To address these issues, we introduce PumGPT, the first e-commerce specialized LVLM designed for multi-modal product understanding tasks. We collected and curated a dataset of over one million products from AliExpress, filtering out non-inferable attributes using a universal hallucination detection framework, resulting in 663k high-quality data samples. PumGPT focuses on five essential tasks aimed at enhancing workflows for e-commerce platforms and retailers. We also introduce PumBench, a benchmark to evaluate product understanding across LVLMs. Our experiments show that PumGPT outperforms five other open-source LVLMs and GPT-4V in product understanding tasks. We also conduct extensive analytical experiments to delve deeply into the superiority of PumGPT, demonstrating the necessity for a specialized model in the e-commerce domain.

PUMGPT: A Large Vision-Language Model for Product Understanding

TL;DR

The paper tackles the challenge of reliable, domain-specific product understanding in e-commerce by introducing PumGPT, a large vision-language model trained on a hallucination-filtered dataset of ~663k high-quality AliExpress products (from an original ~1M). It presents a universal multi-expert framework to detect and filter inconsistent attributes, enabling robust attribute inference and correction within five practical tasks (CG, CMC, AI, CC, AC) and introduces PumBench to evaluate LVLM performance in this domain. Empirical results show PumGPT outperforms five open-source LVLMs and GPT-4V across tasks, with notable gains in attribute inference and rejection handling, and strong domain-level performance, especially in handling standardized vs non-standardized attributes. The work underscores the importance of domain specialization and data-quality controls for practical e-commerce workflows, and outlines future improvements in task diversity and data quality to further enhance performance.

Abstract

E-commerce platforms benefit from accurate product understanding to enhance user experience and operational efficiency. Traditional methods often focus on isolated tasks such as attribute extraction or categorization, posing adaptability issues to evolving tasks and leading to usability challenges with noisy data from the internet. Current Large Vision Language Models (LVLMs) lack domain-specific fine-tuning, thus falling short in precision and instruction following. To address these issues, we introduce PumGPT, the first e-commerce specialized LVLM designed for multi-modal product understanding tasks. We collected and curated a dataset of over one million products from AliExpress, filtering out non-inferable attributes using a universal hallucination detection framework, resulting in 663k high-quality data samples. PumGPT focuses on five essential tasks aimed at enhancing workflows for e-commerce platforms and retailers. We also introduce PumBench, a benchmark to evaluate product understanding across LVLMs. Our experiments show that PumGPT outperforms five other open-source LVLMs and GPT-4V in product understanding tasks. We also conduct extensive analytical experiments to delve deeply into the superiority of PumGPT, demonstrating the necessity for a specialized model in the e-commerce domain.
Paper Structure (20 sections, 1 equation, 4 figures, 10 tables)

This paper contains 20 sections, 1 equation, 4 figures, 10 tables.

Figures (4)

  • Figure 1: A glimpse on PumGPT in product understanding.
  • Figure 2: The overview of our proposed hallucination detection framework.
  • Figure 3: Most common attribute names and proportion of 8 primary categories.
  • Figure 4: Ablation on hallucination filtering. Here we report the accuracy of the attribution inference task, where w Hallu means it was trained on the hallucination dataset and w/o Hallu means was trained on the hallucination-free dataset.