"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Jiawei Yu; Xiang Geng; Yuang Li; Mengxin Ren; Wei Tang; Jiahuan Li; Zhibin Lan; Min Zhang; Hao Yang; Shujian Huang; Jinsong Su

"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Jiawei Yu, Xiang Geng, Yuang Li, Mengxin Ren, Wei Tang, Jiahuan Li, Zhibin Lan, Min Zhang, Hao Yang, Shujian Huang, Jinsong Su

TL;DR

This work tackles the challenge of recognizing unseen entities in Spoken NER by leveraging a Named Entity Dictionary (NED) to generate synthetic data via a large language model (LLM) and text-to-speech (TTS), coupled with a noise-filtering mechanism. The HeardU framework demonstrates strong performance gains across in-domain, zero-shot domain adaptation, and fully zero-shot settings, and introduces the ST-CMDS-NER Chinese benchmark with 8,853 NED entries. The approach emphasizes efficient NED construction over full data annotation and shows robust generalization through LLM and TTS components, aided by a quantified noise metric. By releasing the benchmark and NED, the authors enable broader evaluation and adaptation for unseen-entity Spoken NER tasks.

Abstract

Spoken named entity recognition (NER) aims to identify named entities from speech, playing an important role in speech processing. New named entities appear every day, however, annotating their Spoken NER data is costly. In this paper, we demonstrate that existing Spoken NER systems perform poorly when dealing with previously unseen named entities. To tackle this challenge, we propose a method for generating Spoken NER data based on a named entity dictionary (NED) to reduce costs. Specifically, we first use a large language model (LLM) to generate sentences from the sampled named entities and then use a text-to-speech (TTS) system to generate the speech. Furthermore, we introduce a noise metric to filter out noisy data. To evaluate our approach, we release a novel Spoken NER benchmark along with a corresponding NED containing 8,853 entities. Experiment results show that our method achieves state-of-the-art (SOTA) performance in the in-domain, zero-shot domain adaptation, and fully zero-shot settings. Our data will be available at https://github.com/DeepLearnXMU/HeardU.

"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

TL;DR

Abstract

"I've Heard of You!": Generate Spoken Named Entity Recognition Data for Unseen Entities

Authors

TL;DR

Abstract

Table of Contents

Figures (3)