Table of Contents
Fetching ...

Gazetteer-Enhanced Bangla Named Entity Recognition with BanglaBERT Semantic Embeddings K-Means-Infused CRF Model

Niloy Farhan, Saman Sarker Joy, Tafseer Binte Mannan, Farig Sadeque

TL;DR

This work tackles Bangla Named Entity Recognition by combining a large, externally maintained Gazetteer with BanglaBERT semantic embeddings in a Conditional Random Field framework. The Gazetteer, containing 93,749 Bangla entities collected from Wikidata and supplementary sources and stored in a Trie, provides explicit label cues that improve sequence labeling. The authors couple Gazetteer and BanglaBERT-derived features (including K-means cluster IDs and softmax outputs) within a CRF model, achieving a Macro F1 of 0.8267 on the MultiCoNER I Bangla test set, surpassing baseline BiLSTM, Transformer, and BanglaBERT-only approaches. This demonstrates the practical value of knowledge-based resources for low-resource Bangla NER and provides publicly available gazetteer data and code for replication and further research.

Abstract

Named Entity Recognition (NER) is a sub-task of Natural Language Processing (NLP) that distinguishes entities from unorganized text into predefined categorization. In recent years, a lot of Bangla NLP subtasks have received quite a lot of attention; but Named Entity Recognition in Bangla still lags behind. In this research, we explored the existing state of research in Bangla Named Entity Recognition. We tried to figure out the limitations that current techniques and datasets face, and we would like to address these limitations in our research. Additionally, We developed a Gazetteer that has the ability to significantly boost the performance of NER. We also proposed a new NER solution by taking advantage of state-of-the-art NLP tools that outperform conventional techniques.

Gazetteer-Enhanced Bangla Named Entity Recognition with BanglaBERT Semantic Embeddings K-Means-Infused CRF Model

TL;DR

This work tackles Bangla Named Entity Recognition by combining a large, externally maintained Gazetteer with BanglaBERT semantic embeddings in a Conditional Random Field framework. The Gazetteer, containing 93,749 Bangla entities collected from Wikidata and supplementary sources and stored in a Trie, provides explicit label cues that improve sequence labeling. The authors couple Gazetteer and BanglaBERT-derived features (including K-means cluster IDs and softmax outputs) within a CRF model, achieving a Macro F1 of 0.8267 on the MultiCoNER I Bangla test set, surpassing baseline BiLSTM, Transformer, and BanglaBERT-only approaches. This demonstrates the practical value of knowledge-based resources for low-resource Bangla NER and provides publicly available gazetteer data and code for replication and further research.

Abstract

Named Entity Recognition (NER) is a sub-task of Natural Language Processing (NLP) that distinguishes entities from unorganized text into predefined categorization. In recent years, a lot of Bangla NLP subtasks have received quite a lot of attention; but Named Entity Recognition in Bangla still lags behind. In this research, we explored the existing state of research in Bangla Named Entity Recognition. We tried to figure out the limitations that current techniques and datasets face, and we would like to address these limitations in our research. Additionally, We developed a Gazetteer that has the ability to significantly boost the performance of NER. We also proposed a new NER solution by taking advantage of state-of-the-art NLP tools that outperform conventional techniques.
Paper Structure (31 sections, 2 equations, 10 figures, 6 tables)

This paper contains 31 sections, 2 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: NER Example in Bangla
  • Figure 2: BIO Tags
  • Figure 3: Workflow of the formation of our gazetteer
  • Figure 4: Trie Data Structure
  • Figure 5: Fine-tuned BanglaBERT Large Model
  • ...and 5 more figures