Table of Contents
Fetching ...

ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models

Yuzhao Heng, Chunyuan Deng, Yitong Li, Yue Yu, Yinghao Li, Rongzhi Zhang, Chao Zhang

TL;DR

ProgGen presents a self-reflective, stepwise data-generation framework to create high-quality NER datasets with limited supervision. By decomposing generation into entity-term creation, attribute-driven sentence diversification, and self-corrected annotation, the method achieves robust performance across general and niche domains at lower cost than traditional LLM-based data augmentation. Empirical results on CoNLL-2003, WikiGold, MIT-Movie, and MIT-Restaurant show that diversity-focused variants (notably Diversify Y) plus optional self-correction yield meaningful F1 gains while maintaining cost efficiency. The work highlights the importance of entity diversity and annotation accuracy as key drivers of performance, and discusses domain-specific challenges and biases in LLM-based annotation.

Abstract

Although Large Language Models (LLMs) exhibit remarkable adaptability across domains, these models often fall short in structured knowledge extraction tasks such as named entity recognition (NER). This paper explores an innovative, cost-efficient strategy to harness LLMs with modest NER capabilities for producing superior NER datasets. Our approach diverges from the basic class-conditional prompts by instructing LLMs to self-reflect on the specific domain, thereby generating domain-relevant attributes (such as category and emotions for movie reviews), which are utilized for creating attribute-rich training data. Furthermore, we preemptively generate entity terms and then develop NER context data around these entities, effectively bypassing the LLMs' challenges with complex structures. Our experiments across both general and niche domains reveal significant performance enhancements over conventional data generation methods while being more cost-effective than existing alternatives.

ProgGen: Generating Named Entity Recognition Datasets Step-by-step with Self-Reflexive Large Language Models

TL;DR

ProgGen presents a self-reflective, stepwise data-generation framework to create high-quality NER datasets with limited supervision. By decomposing generation into entity-term creation, attribute-driven sentence diversification, and self-corrected annotation, the method achieves robust performance across general and niche domains at lower cost than traditional LLM-based data augmentation. Empirical results on CoNLL-2003, WikiGold, MIT-Movie, and MIT-Restaurant show that diversity-focused variants (notably Diversify Y) plus optional self-correction yield meaningful F1 gains while maintaining cost efficiency. The work highlights the importance of entity diversity and annotation accuracy as key drivers of performance, and discusses domain-specific challenges and biases in LLM-based annotation.

Abstract

Although Large Language Models (LLMs) exhibit remarkable adaptability across domains, these models often fall short in structured knowledge extraction tasks such as named entity recognition (NER). This paper explores an innovative, cost-efficient strategy to harness LLMs with modest NER capabilities for producing superior NER datasets. Our approach diverges from the basic class-conditional prompts by instructing LLMs to self-reflect on the specific domain, thereby generating domain-relevant attributes (such as category and emotions for movie reviews), which are utilized for creating attribute-rich training data. Furthermore, we preemptively generate entity terms and then develop NER context data around these entities, effectively bypassing the LLMs' challenges with complex structures. Our experiments across both general and niche domains reveal significant performance enhancements over conventional data generation methods while being more cost-effective than existing alternatives.
Paper Structure (145 sections, 1 equation, 5 figures, 36 tables)

This paper contains 145 sections, 1 equation, 5 figures, 36 tables.

Figures (5)

  • Figure 1: ProgGen NER data generation pipeline. Given a dataset domain, a set of interested entity classes with definitions, and a few demo samples, we prompt an LLM step-by-step to generate diverse NER samples. We leverage the generated samples to train a small model for NER.
  • Figure 2: Diversity Requirement Generation workflow. (Left) Diversify Sentence: LLMs are prompted to generate attribute dimensions first and then attribute values given each dimension. (Right) Diversify Entities: LLMs are prompted to generate named entities for entity class, optionally conditioned on a domain-specific "topic" category.
  • Figure 3: Sample Diversity Scaling plot. Corresponding F1 from the main results (Table \ref{['tbl:main-results']}) are shown in dashed lines.
  • Figure 4: Step-wise NER Sample Generation pipeline.
  • Figure 5: LLM Entity Type Self-Correction counts by dataset.