Table of Contents
Fetching ...

A Framework for Leveraging Partially-Labeled Data for Product Attribute-Value Identification

D. Subhalingam, Keshav Kolluru, Mausam, Saurabh Singal

TL;DR

GenToC, a model designed for training directly with partially-labeled data, eliminates the necessity for a fully annotated dataset, and demonstrates GenToC's unique ability to learn from a limited set of partially-labeled data and improve the training of more efficient models, advancing the automated extraction of attribute-value pairs.

Abstract

In the e-commerce domain, the accurate extraction of attribute-value pairs (e.g., Brand: Apple) from product titles and user search queries is crucial for enhancing search and recommendation systems. A major challenge with neural models for this task is the lack of high-quality training data, as the annotations for attribute-value pairs in the available datasets are often incomplete. To address this, we introduce GenToC, a model designed for training directly with partially-labeled data, eliminating the necessity for a fully annotated dataset. GenToC employs a marker-augmented generative model to identify potential attributes, followed by a token classification model that determines the associated values for each attribute. GenToC outperforms existing state-of-the-art models, exhibiting upto 56.3% increase in the number of accurate extractions. Furthermore, we utilize GenToC to regenerate the training dataset to expand attribute-value annotations. This bootstrapping substantially improves the data quality for training other standard NER models, which are typically faster but less capable in handling partially-labeled data, enabling them to achieve comparable performance to GenToC. Our results demonstrate GenToC's unique ability to learn from a limited set of partially-labeled data and improve the training of more efficient models, advancing the automated extraction of attribute-value pairs. Finally, our model has been successfully integrated into IndiaMART, India's largest B2B e-commerce platform, achieving a significant increase of 20.2% in the number of correctly identified attribute-value pairs over the existing deployed system while achieving a high precision of 89.5%.

A Framework for Leveraging Partially-Labeled Data for Product Attribute-Value Identification

TL;DR

GenToC, a model designed for training directly with partially-labeled data, eliminates the necessity for a fully annotated dataset, and demonstrates GenToC's unique ability to learn from a limited set of partially-labeled data and improve the training of more efficient models, advancing the automated extraction of attribute-value pairs.

Abstract

In the e-commerce domain, the accurate extraction of attribute-value pairs (e.g., Brand: Apple) from product titles and user search queries is crucial for enhancing search and recommendation systems. A major challenge with neural models for this task is the lack of high-quality training data, as the annotations for attribute-value pairs in the available datasets are often incomplete. To address this, we introduce GenToC, a model designed for training directly with partially-labeled data, eliminating the necessity for a fully annotated dataset. GenToC employs a marker-augmented generative model to identify potential attributes, followed by a token classification model that determines the associated values for each attribute. GenToC outperforms existing state-of-the-art models, exhibiting upto 56.3% increase in the number of accurate extractions. Furthermore, we utilize GenToC to regenerate the training dataset to expand attribute-value annotations. This bootstrapping substantially improves the data quality for training other standard NER models, which are typically faster but less capable in handling partially-labeled data, enabling them to achieve comparable performance to GenToC. Our results demonstrate GenToC's unique ability to learn from a limited set of partially-labeled data and improve the training of more efficient models, advancing the automated extraction of attribute-value pairs. Finally, our model has been successfully integrated into IndiaMART, India's largest B2B e-commerce platform, achieving a significant increase of 20.2% in the number of correctly identified attribute-value pairs over the existing deployed system while achieving a high precision of 89.5%.
Paper Structure (22 sections, 5 figures, 8 tables, 2 algorithms)

This paper contains 22 sections, 5 figures, 8 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overall framework. We train GenToC system with markers to effectively learn from incomplete training data. It is then used to bootstrap high-quality training data to train the real-time NER attribute-value extraction (AVE) system.
  • Figure 2: Model architectures. (a) Seq2Seq-AVE outputs a string that concatenates all attribute-value pairs for a given input query. (b) NER-AVE classifies each word in the query, tagging it with the relevant attribute. (c) GenToC employs Gen-AE to yield a concatenated list of attributes and ToC-VE to annotate the values linked to every recognized attribute. The Gen-AE model incorporates markers ('M') during the training process for the words which are covered. During inference, these markers are applied to all the words in the query.
  • Figure 3: Distribution of product categories within training dataset. Specific percentage values are omitted to preserve data confidentiality.
  • Figure 4: Precision-Recall curves show that GenToC and NER-AVE ( GenToC bootstrapping) significantly outperform remaining models, NER-AVE and Seq2Seq-AVE.
  • Figure 5: Performance of NER-AVE, Seq2Seq-AVE, GenToC, and GenToC-bootstrapped NER-AVE on long-tail attribute names. NER-AVE trained on original data shows poor performance on infrequent attribute names.