Table of Contents
Fetching ...

NERsocial: Efficient Named Entity Recognition Dataset Construction for Human-Robot Interaction Utilizing RapidNER

Jesse Atuhurra, Hidetaka Kamigaito, Hiroki Ouchi, Hiroyuki Shindo, Taro Watanabe

TL;DR

NERsocial introduces RapidNER, a framework for rapid NER dataset construction in human–robot interaction by integrating Wikidata-derived hyponym dictionaries, Wikipedia content, and Elasticsearch-based annotation. The resulting NERsocial dataset spans six entity types (Drink, Food, Hobby, Job, Pet, Sport) and includes 99.4K sentences drawn from Wikipedia, Reddit, and Stack Exchange, with high annotator agreement and token coverage. Transformer models trained on NERsocial achieve state-of-the-art-like F1 scores around 96% across entities, and ablations show that combining multiple data sources enhances robustness to domain shift. The approach demonstrates a scalable, domain-adaptive path to NER data creation, with implications for rapid deployment in HRI and other specialized domains.

Abstract

Adapting named entity recognition (NER) methods to new domains poses significant challenges. We introduce RapidNER, a framework designed for the rapid deployment of NER systems through efficient dataset construction. RapidNER operates through three key steps: (1) extracting domain-specific sub-graphs and triples from a general knowledge graph, (2) collecting and leveraging texts from various sources to build the NERsocial dataset, which focuses on entities typical in human-robot interaction, and (3) implementing an annotation scheme using Elasticsearch (ES) to enhance efficiency. NERsocial, validated by human annotators, includes six entity types, 153K tokens, and 99.4K sentences, demonstrating RapidNER's capability to expedite dataset creation.

NERsocial: Efficient Named Entity Recognition Dataset Construction for Human-Robot Interaction Utilizing RapidNER

TL;DR

NERsocial introduces RapidNER, a framework for rapid NER dataset construction in human–robot interaction by integrating Wikidata-derived hyponym dictionaries, Wikipedia content, and Elasticsearch-based annotation. The resulting NERsocial dataset spans six entity types (Drink, Food, Hobby, Job, Pet, Sport) and includes 99.4K sentences drawn from Wikipedia, Reddit, and Stack Exchange, with high annotator agreement and token coverage. Transformer models trained on NERsocial achieve state-of-the-art-like F1 scores around 96% across entities, and ablations show that combining multiple data sources enhances robustness to domain shift. The approach demonstrates a scalable, domain-adaptive path to NER data creation, with implications for rapid deployment in HRI and other specialized domains.

Abstract

Adapting named entity recognition (NER) methods to new domains poses significant challenges. We introduce RapidNER, a framework designed for the rapid deployment of NER systems through efficient dataset construction. RapidNER operates through three key steps: (1) extracting domain-specific sub-graphs and triples from a general knowledge graph, (2) collecting and leveraging texts from various sources to build the NERsocial dataset, which focuses on entities typical in human-robot interaction, and (3) implementing an annotation scheme using Elasticsearch (ES) to enhance efficiency. NERsocial, validated by human annotators, includes six entity types, 153K tokens, and 99.4K sentences, demonstrating RapidNER's capability to expedite dataset creation.

Paper Structure

This paper contains 37 sections, 16 figures, 24 tables, 1 algorithm.

Figures (16)

  • Figure 1: Our NER dataset aims to support dialogue between humans and robots. New entity types are Drink, Food, Hobby, Job, Pet, Sport. We utilize Wikidata to acquire information about entity types.
  • Figure 2: We collected the texts from three sources: Wikipedia, online forums (Stack Exchange) and social media (Reddit). The red box indicates sections containing the texts that we are interested in, for each textual source.
  • Figure 3: The construction process of NERsocial. We gathered millions of triples from Wikidata and used the triples to collect Wikipedia articles. For each Wikipedia article, we extracted paragraphs from the introduction sections, and split them into sentences. Additionally, we collected conversational texts from Reddit and Stack Exchange. For each sentence, we annotated spans of text containing entity mentions with the help of ES. Human annotators verified text-span annotations before the text-spans were converted into NE labels.
  • Figure 4: The UMAP visualization shows the diversity of texts from three data sources: Reddit, Stack Exchange, and Wikipedia (best seen in color).
  • Figure 5: Text-span annotations inside ES (underlined text in figure). Some text-spans were incorrectly annotated (marked inside [ ] symbols above). For example, Food mentions were completely missed (i.e., puttu, lapis legit, kastengel, risoles), or a part of the entity mention was left outside the text span (i.e., in kue ku, kue putu). When manually checking the correctness of text-spans annotated with ES, we corrected these spans of text.
  • ...and 11 more figures