Table of Contents
Fetching ...

Named Entity Recognition for Address Extraction in Speech-to-Text Transcriptions Using Synthetic Data

Bibiána Lajčinová, Patrik Valábek, Michal Spišiak

TL;DR

The paper addresses extracting address components from speech-to-text Slovak transcriptions using a SlovakBERT-based NER with a 9-label BIO scheme. Due to limited real data, it relies on large-scale synthetic data generated via GPT-3.5-turbo, iteratively refined through error-driven augmentation. The approach achieves over 90% accuracy on a real test set, demonstrating the viability of synthetic data for domain-specific NER in low-resource languages. This enables robust address parsing in STT pipelines and offers a practical path when real labeled data are scarce, with potential for retraining as more real data become available.

Abstract

This paper introduces an approach for building a Named Entity Recognition (NER) model built upon a Bidirectional Encoder Representations from Transformers (BERT) architecture, specifically utilizing the SlovakBERT model. This NER model extracts address parts from data acquired from speech-to-text transcriptions. Due to scarcity of real data, a synthetic dataset using GPT API was generated. The importance of mimicking spoken language variability in this artificial data is emphasized. The performance of our NER model, trained solely on synthetic data, is evaluated using small real test dataset.

Named Entity Recognition for Address Extraction in Speech-to-Text Transcriptions Using Synthetic Data

TL;DR

The paper addresses extracting address components from speech-to-text Slovak transcriptions using a SlovakBERT-based NER with a 9-label BIO scheme. Due to limited real data, it relies on large-scale synthetic data generated via GPT-3.5-turbo, iteratively refined through error-driven augmentation. The approach achieves over 90% accuracy on a real test set, demonstrating the viability of synthetic data for domain-specific NER in low-resource languages. This enables robust address parsing in STT pipelines and offers a practical path when real labeled data are scarce, with potential for retraining as more real data become available.

Abstract

This paper introduces an approach for building a Named Entity Recognition (NER) model built upon a Bidirectional Encoder Representations from Transformers (BERT) architecture, specifically utilizing the SlovakBERT model. This NER model extracts address parts from data acquired from speech-to-text transcriptions. Due to scarcity of real data, a synthetic dataset using GPT API was generated. The importance of mimicking spoken language variability in this artificial data is emphasized. The performance of our NER model, trained solely on synthetic data, is evaluated using small real test dataset.
Paper Structure (8 sections, 6 tables)