NERsocial: Efficient Named Entity Recognition Dataset Construction for Human-Robot Interaction Utilizing RapidNER
Jesse Atuhurra, Hidetaka Kamigaito, Hiroki Ouchi, Hiroyuki Shindo, Taro Watanabe
TL;DR
NERsocial introduces RapidNER, a framework for rapid NER dataset construction in human–robot interaction by integrating Wikidata-derived hyponym dictionaries, Wikipedia content, and Elasticsearch-based annotation. The resulting NERsocial dataset spans six entity types (Drink, Food, Hobby, Job, Pet, Sport) and includes 99.4K sentences drawn from Wikipedia, Reddit, and Stack Exchange, with high annotator agreement and token coverage. Transformer models trained on NERsocial achieve state-of-the-art-like F1 scores around 96% across entities, and ablations show that combining multiple data sources enhances robustness to domain shift. The approach demonstrates a scalable, domain-adaptive path to NER data creation, with implications for rapid deployment in HRI and other specialized domains.
Abstract
Adapting named entity recognition (NER) methods to new domains poses significant challenges. We introduce RapidNER, a framework designed for the rapid deployment of NER systems through efficient dataset construction. RapidNER operates through three key steps: (1) extracting domain-specific sub-graphs and triples from a general knowledge graph, (2) collecting and leveraging texts from various sources to build the NERsocial dataset, which focuses on entities typical in human-robot interaction, and (3) implementing an annotation scheme using Elasticsearch (ES) to enhance efficiency. NERsocial, validated by human annotators, includes six entity types, 153K tokens, and 99.4K sentences, demonstrating RapidNER's capability to expedite dataset creation.
