Table of Contents
Fetching ...

Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages

Sankalp Bahad, Pruthwik Mishra, Karunesh Arora, Rakesh Chandra Balabantaray, Dipti Misra Sharma, Parameswari Krishnamurthy

TL;DR

This study addresses NER for Indian languages by creating a 40K-sentence annotated corpus across Hindi, Odia, Urdu, and Telugu and demonstrating the viability of fine-tuning multilingual transformer models for these languages. It compares monolingual baselines (BERT) and an existing NER model (HiNER) with a multilingual approach based on XLM-RoBERTa, showing comparable or improved performance (approximately 0.75–0.83 F1 across languages) and notable cross-language transfer effects, especially for Odia and Urdu. The work emphasizes cross-lingual progressive transfer learning with vocabulary augmentation, native-script training over romanization, and provides valuable resources (datasets and evaluations) to advance NER in low-resource Indian languages. The results underscore the practical potential of transfer learning to scale NER to multiple Indian languages and domains, with implications for downstream NLP tasks and governance-related applications.

Abstract

Named Entity Recognition (NER) is a useful component in Natural Language Processing (NLP) applications. It is used in various tasks such as Machine Translation, Summarization, Information Retrieval, and Question-Answering systems. The research on NER is centered around English and some other major languages, whereas limited attention has been given to Indian languages. We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian Languages. We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families. Additionally,we present a multilingual model fine-tuned on our dataset, which achieves an F1 score of 0.80 on our dataset on average. We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.

Fine-tuning Pre-trained Named Entity Recognition Models For Indian Languages

TL;DR

This study addresses NER for Indian languages by creating a 40K-sentence annotated corpus across Hindi, Odia, Urdu, and Telugu and demonstrating the viability of fine-tuning multilingual transformer models for these languages. It compares monolingual baselines (BERT) and an existing NER model (HiNER) with a multilingual approach based on XLM-RoBERTa, showing comparable or improved performance (approximately 0.75–0.83 F1 across languages) and notable cross-language transfer effects, especially for Odia and Urdu. The work emphasizes cross-lingual progressive transfer learning with vocabulary augmentation, native-script training over romanization, and provides valuable resources (datasets and evaluations) to advance NER in low-resource Indian languages. The results underscore the practical potential of transfer learning to scale NER to multiple Indian languages and domains, with implications for downstream NLP tasks and governance-related applications.

Abstract

Named Entity Recognition (NER) is a useful component in Natural Language Processing (NLP) applications. It is used in various tasks such as Machine Translation, Summarization, Information Retrieval, and Question-Answering systems. The research on NER is centered around English and some other major languages, whereas limited attention has been given to Indian languages. We analyze the challenges and propose techniques that can be tailored for Multilingual Named Entity Recognition for Indian Languages. We present a human annotated named entity corpora of 40K sentences for 4 Indian languages from two of the major Indian language families. Additionally,we present a multilingual model fine-tuned on our dataset, which achieves an F1 score of 0.80 on our dataset on average. We achieve comparable performance on completely unseen benchmark datasets for Indian languages which affirms the usability of our model.
Paper Structure (9 sections, 1 figure, 18 tables)