Table of Contents
Fetching ...

Visual Zero-Shot E-Commerce Product Attribute Value Extraction

Jiaying Gong, Ming Cheng, Hongda Shen, Pierre-Yves Vandenbussche, Janet Jenq, Hoda Eldardiry

TL;DR

ViOC-AG tackles zero-shot product attribute value extraction using only product images. It leverages a CLIP-based framework with a task-specific text decoder trained in a text-only manner, and employs OCR tokens plus a frozen prompt-based LLM to correct out-of-domain outputs during inference. Evaluated on the MAVE dataset, ViOC-AG outperforms fine-tuned vision-language models and approaches text-description baselines, demonstrating the practicality of image-only attribute generation for e-commerce. The approach reduces seller burden while offering scalable attribute-value generation with potential for further improvements in OCR and category-aware decoding.

Abstract

Existing zero-shot product attribute value (aspect) extraction approaches in e-Commerce industry rely on uni-modal or multi-modal models, where the sellers are asked to provide detailed textual inputs (product descriptions) for the products. However, manually providing (typing) the product descriptions is time-consuming and frustrating for the sellers. Thus, we propose a cross-modal zero-shot attribute value generation framework (ViOC-AG) based on CLIP, which only requires product images as the inputs. ViOC-AG follows a text-only training process, where a task-customized text decoder is trained with the frozen CLIP text encoder to alleviate the modality gap and task disconnection. During the zero-shot inference, product aspects are generated by the frozen CLIP image encoder connected with the trained task-customized text decoder. OCR tokens and outputs from a frozen prompt-based LLM correct the decoded outputs for out-of-domain attribute values. Experiments show that ViOC-AG significantly outperforms other fine-tuned vision-language models for zero-shot attribute value extraction.

Visual Zero-Shot E-Commerce Product Attribute Value Extraction

TL;DR

ViOC-AG tackles zero-shot product attribute value extraction using only product images. It leverages a CLIP-based framework with a task-specific text decoder trained in a text-only manner, and employs OCR tokens plus a frozen prompt-based LLM to correct out-of-domain outputs during inference. Evaluated on the MAVE dataset, ViOC-AG outperforms fine-tuned vision-language models and approaches text-description baselines, demonstrating the practicality of image-only attribute generation for e-commerce. The approach reduces seller burden while offering scalable attribute-value generation with potential for further improvements in OCR and category-aware decoding.

Abstract

Existing zero-shot product attribute value (aspect) extraction approaches in e-Commerce industry rely on uni-modal or multi-modal models, where the sellers are asked to provide detailed textual inputs (product descriptions) for the products. However, manually providing (typing) the product descriptions is time-consuming and frustrating for the sellers. Thus, we propose a cross-modal zero-shot attribute value generation framework (ViOC-AG) based on CLIP, which only requires product images as the inputs. ViOC-AG follows a text-only training process, where a task-customized text decoder is trained with the frozen CLIP text encoder to alleviate the modality gap and task disconnection. During the zero-shot inference, product aspects are generated by the frozen CLIP image encoder connected with the trained task-customized text decoder. OCR tokens and outputs from a frozen prompt-based LLM correct the decoded outputs for out-of-domain attribute values. Experiments show that ViOC-AG significantly outperforms other fine-tuned vision-language models for zero-shot attribute value extraction.

Paper Structure

This paper contains 21 sections, 4 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: An example of cross-modal aspect generation.
  • Figure 2: The overview of our proposed ViOC-AG model. Only the projector and the text decoder are trainable.
  • Figure 3: Demonstrations of ViOC-AG for product attribute value generation across eight different categories.
  • Figure 4: Label Count Distribution.