Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

Jimin Sohn; Haeji Jung; Alex Cheng; Jooeon Kang; Yilin Du; David R. Mortensen

Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

Jimin Sohn, Haeji Jung, Alex Cheng, Jooeon Kang, Yilin Du, David R. Mortensen

TL;DR

This paper proposes a novel approach to NER using phonemic representation based on the International Phonetic Alphabet (IPA) to bridge the gap between representations of different languages.

Abstract

Existing zero-shot cross-lingual NER approaches require substantial prior knowledge of the target language, which is impractical for low-resource languages. In this paper, we propose a novel approach to NER using phonemic representation based on the International Phonetic Alphabet (IPA) to bridge the gap between representations of different languages. Our experiments show that our method significantly outperforms baseline models in extremely low-resource languages, with the highest average F1 score (46.38%) and lowest standard deviation (12.67), particularly demonstrating its robustness with non-Latin scripts. Our codes are available at https://github.com/Gabriel819/zeroshot_ner.git

Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

TL;DR

This paper proposes a novel approach to NER using phonemic representation based on the International Phonetic Alphabet (IPA) to bridge the gap between representations of different languages.

Abstract

Paper Structure (23 sections, 5 figures, 6 tables)

This paper contains 23 sections, 5 figures, 6 tables.

Introduction
Related Work
Zero-shot Cross-lingual NER
Phonemic Representation
Our Approach
NER with Phonemes
Cross-lingual Transfer to Unseen Languages
Experiments
Benchmark Dataset
Baseline Models
Results
Zero-Shot NER on Seen Languages
Zero-Shot NER on Unseen Languages
Robustness Across Writing Systems
Conclusion
...and 8 more sections

Figures (5)

Figure 1: Zero-shot Cross-Lingual NER with IPA phonemes.
Figure 2: Distribution of F1 scores for each language set. X-axis shows each model using their first three letters, with '(gr)' and '(ph)' indicating their input forms (graphemes and phonemes, respectively). Colored horizontal lines and the numbers above show the average F1 scores for each model.
Figure 3: NER results on the target language (Sinhala) produced by each model trained on English data: (a) CANINE (b) XPhoneBERT.
Figure 4: Performance distribution of each model on languages using Latin and non-Latin scripts from unseen languages.
Figure 5: Performance distribution of each model on languages using Latin and non-Latin scripts.

Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

TL;DR

Abstract

Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages

Authors

TL;DR

Abstract

Table of Contents

Figures (5)