Table of Contents
Fetching ...

Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition

Yuming Yang, Wantong Zhao, Caishuang Huang, Junjie Ye, Xiao Wang, Huiyuan Zheng, Yang Nan, Yuran Wang, Xueying Xu, Kaixin Huang, Yunke Zhang, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

This work tackles Open NER by addressing two core bottlenecks: inconsistent entity definitions across diverse datasets and data redundancy. It introduces B2NERD, a compact bilingual dataset refined from 54 English/Chinese datasets into a universal taxonomy of 400+ entity types, and a diversity-aware data-pruning strategy to maximize semantic coverage. Through instruction tuning with regularization, the authors train B2NER models that generalize across datasets and languages, outperforming GPT-4 and prior methods on multiple out-of-domain benchmarks while remaining competitive in in-domain tasks. The approach demonstrates strong cross-language transfer, scalable taxonomy expansion, and data-efficient learning, with public release of data, models, and code to support further research.

Abstract

Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets neglects their inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain adaptation. To address this, we present B2NERD, a compact dataset designed to guide LLMs' generalization in Open NER under a universal entity taxonomy. B2NERD is refined from 54 existing English and Chinese datasets using a two-step process. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly enhances LLMs' Open NER capabilities. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages. The data, models, and code are publicly available at https://github.com/UmeanNever/B2NER.

Beyond Boundaries: Learning a Universal Entity Taxonomy across Datasets and Languages for Open Named Entity Recognition

TL;DR

This work tackles Open NER by addressing two core bottlenecks: inconsistent entity definitions across diverse datasets and data redundancy. It introduces B2NERD, a compact bilingual dataset refined from 54 English/Chinese datasets into a universal taxonomy of 400+ entity types, and a diversity-aware data-pruning strategy to maximize semantic coverage. Through instruction tuning with regularization, the authors train B2NER models that generalize across datasets and languages, outperforming GPT-4 and prior methods on multiple out-of-domain benchmarks while remaining competitive in in-domain tasks. The approach demonstrates strong cross-language transfer, scalable taxonomy expansion, and data-efficient learning, with public release of data, models, and code to support further research.

Abstract

Open Named Entity Recognition (NER), which involves identifying arbitrary types of entities from arbitrary domains, remains challenging for Large Language Models (LLMs). Recent studies suggest that fine-tuning LLMs on extensive NER data can boost their performance. However, training directly on existing datasets neglects their inconsistent entity definitions and redundant data, limiting LLMs to dataset-specific learning and hindering out-of-domain adaptation. To address this, we present B2NERD, a compact dataset designed to guide LLMs' generalization in Open NER under a universal entity taxonomy. B2NERD is refined from 54 existing English and Chinese datasets using a two-step process. First, we detect inconsistent entity definitions across datasets and clarify them by distinguishable label names to construct a universal taxonomy of 400+ entity types. Second, we address redundancy using a data pruning strategy that selects fewer samples with greater category and semantic diversity. Comprehensive evaluation shows that B2NERD significantly enhances LLMs' Open NER capabilities. Our B2NER models, trained on B2NERD, outperform GPT-4 by 6.8-12.0 F1 points and surpass previous methods in 3 out-of-domain benchmarks across 15 datasets and 6 languages. The data, models, and code are publicly available at https://github.com/UmeanNever/B2NER.
Paper Structure (38 sections, 1 equation, 10 figures, 17 tables)

This paper contains 38 sections, 1 equation, 10 figures, 17 tables.

Figures (10)

  • Figure 1: The Open NER task aims to extract arbitrary entities (common and unseen) from arbitrary domains (in-domain and out-of-domain). Current LLMs, like GPT-4, still fall short on this task.
  • Figure 2: Sample results of BERT-based cross-dataset entity validation for LOC entity. Light colors indicate conflict entity definitions. Training LLM on these inconsistent datasets leads to confusions during inference.
  • Figure 3: Framework of B2NERD data construction: raw NER datasets are reshaped into a cohesive dataset via entity definition standardization and diversity-aware data pruning. Final data is then used to train our Open NER model.
  • Figure 5: Data scaling results for different sampling methods. Diversity-aware strategy (blue) achieves better performance with fewer samples.
  • Figure 6: More results from model-based cross validation on PER and ORG among 4 datasets. The horizontal axis represents testing data and vertical represents training data.
  • ...and 5 more figures