ILID: Native Script Language Identification for Indian Languages
Yash Ingle, Pruthwik Mishra
TL;DR
This paper tackles language identification for Indian languages by introducing ILID, a benchmark dataset comprising 250K sentences across 23 languages and 25 scripts, designed to address script overlap and code-mixing. It evaluates three modelling families—TF-IDF based ML (word and character features with ensembles), FastText, and a fine-tuned MuRIL BERT model—using macro F1 and ensemble voting, and shows ensembles often outperform single models and MuRIL on many languages. The dataset is created via dual data collection strategies (web scraping and Bhashaverse sampling) with 80:10:10 splits, and comprehensive corpus statistics are reported to characterize linguistic richness. The work provides a public resource for Indian NLP, delivering robust baselines and highlighting remaining challenges for low-resource languages and script-rich code-mixed data.
Abstract
The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script, making the task even more challenging. Taking all these challenges into account, we develop and release a dataset of 250K sentences consisting of 23 languages including English and all 22 official Indian languages labeled with their language identifiers, where data in most languages are newly created. We also develop and release baseline models using state-of-the-art approaches in machine learning and fine-tuning pre-trained transformer models. Our models outperforms the state-of-the-art pre-trained transformer models for the language identification task. The dataset and the codes are available at https://yashingle-ai.github.io/ILID/ and in Huggingface open source libraries.
