Gazetteer-Enhanced Bangla Named Entity Recognition with BanglaBERT Semantic Embeddings K-Means-Infused CRF Model

Niloy Farhan; Saman Sarker Joy; Tafseer Binte Mannan; Farig Sadeque

Gazetteer-Enhanced Bangla Named Entity Recognition with BanglaBERT Semantic Embeddings K-Means-Infused CRF Model

Niloy Farhan, Saman Sarker Joy, Tafseer Binte Mannan, Farig Sadeque

TL;DR

This work tackles Bangla Named Entity Recognition by combining a large, externally maintained Gazetteer with BanglaBERT semantic embeddings in a Conditional Random Field framework. The Gazetteer, containing 93,749 Bangla entities collected from Wikidata and supplementary sources and stored in a Trie, provides explicit label cues that improve sequence labeling. The authors couple Gazetteer and BanglaBERT-derived features (including K-means cluster IDs and softmax outputs) within a CRF model, achieving a Macro F1 of 0.8267 on the MultiCoNER I Bangla test set, surpassing baseline BiLSTM, Transformer, and BanglaBERT-only approaches. This demonstrates the practical value of knowledge-based resources for low-resource Bangla NER and provides publicly available gazetteer data and code for replication and further research.

Abstract

Named Entity Recognition (NER) is a sub-task of Natural Language Processing (NLP) that distinguishes entities from unorganized text into predefined categorization. In recent years, a lot of Bangla NLP subtasks have received quite a lot of attention; but Named Entity Recognition in Bangla still lags behind. In this research, we explored the existing state of research in Bangla Named Entity Recognition. We tried to figure out the limitations that current techniques and datasets face, and we would like to address these limitations in our research. Additionally, We developed a Gazetteer that has the ability to significantly boost the performance of NER. We also proposed a new NER solution by taking advantage of state-of-the-art NLP tools that outperform conventional techniques.

Gazetteer-Enhanced Bangla Named Entity Recognition with BanglaBERT Semantic Embeddings K-Means-Infused CRF Model

TL;DR

Abstract

Paper Structure (31 sections, 2 equations, 10 figures, 6 tables)

This paper contains 31 sections, 2 equations, 10 figures, 6 tables.

Introduction
Related Works
Dataset
MultiCoNER I Dataset
Gazetteers
Extracting Bangla Data from Wikidata and Scraping
Converting the English Data
Cleaning the data
Trie Data Structure
Exploratory Data Analysis
Large Test Data
Irregularities in the Punctuations
Presence of Foreign Words
Imbalanced Dataset
Alignment of tags after tokenization
...and 16 more sections

Figures (10)

Figure 1: NER Example in Bangla
Figure 2: BIO Tags
Figure 3: Workflow of the formation of our gazetteer
Figure 4: Trie Data Structure
Figure 5: Fine-tuned BanglaBERT Large Model
...and 5 more figures

Gazetteer-Enhanced Bangla Named Entity Recognition with BanglaBERT Semantic Embeddings K-Means-Infused CRF Model

TL;DR

Abstract

Gazetteer-Enhanced Bangla Named Entity Recognition with BanglaBERT Semantic Embeddings K-Means-Infused CRF Model

Authors

TL;DR

Abstract

Table of Contents

Figures (10)