Self-Refinement Strategies for LLM-based Product Attribute Value Extraction
Alexander Brinkmann, Christian Bizer
TL;DR
The study evaluates two automated self-refinement techniques—error-based prompt rewriting and self-correction—for LLM-based product attribute value extraction on OA-Mine and AE-110K. Across zero-shot, few-shot, and fine-tuning regimes (using GPT-4o), both techniques fail to deliver meaningful accuracy gains and incur higher token costs. Fine-tuning the model with development data produces the highest F1 and is most cost-effective when many product descriptions must be processed, while self-refinement remains generally unnecessary for this task. The findings advise using fine-tuning for large-scale attribute extraction in e-commerce unless future refinements reduce the cost or improve the performance of self-refinement methods.
Abstract
Structured product data, in the form of attribute-value pairs, is essential for e-commerce platforms to support features such as faceted product search and attribute-based product comparison. However, vendors often provide unstructured product descriptions, making attribute value extraction necessary to ensure data consistency and usability. Large language models (LLMs) have demonstrated their potential for product attribute value extraction in few-shot scenarios. Recent research has shown that self-refinement techniques can improve the performance of LLMs on tasks such as code generation and text-to-SQL translation. For other tasks, the application of these techniques has resulted in increased costs due to processing additional tokens, without achieving any improvement in performance. This paper investigates applying two self-refinement techniques (error-based prompt rewriting and self-correction) to the product attribute value extraction task. The self-refinement techniques are evaluated across zero-shot, few-shot in-context learning, and fine-tuning scenarios using GPT-4o. The experiments show that both self-refinement techniques fail to significantly improve the extraction performance while substantially increasing processing costs. For scenarios with development data, fine-tuning yields the highest performance, while the ramp-up costs of fine-tuning are balanced out as the amount of product descriptions increases.
