L3Cube-MahaSocialNER: A Social Media based Marathi NER Dataset and BERT models
Harsh Chaudhari, Anuja Patil, Dhanashree Lavekar, Pranav Khairnar, Raviraj Joshi
TL;DR
This paper addresses the lack of robust Marathi social-media NER data by introducing L3Cube-MahaSocialNER, the first and largest Marathi social-media NER dataset with 18k labeled sentences across eight entity classes in both IOB and non-IOB formats. It evaluates CNN, LSTM, BiLSTM, and a broad set of Transformer models, including MahaNER-BERT and other Marathi pre-trained variants, demonstrating that fine-tuning existing regular-NER models on the social-NER data yields the best performance. A key finding is that zero-shot results from non-social NER models underperform significantly on social text, underscoring the need for domain-specific data; MahaNER-BERT achieves the top F1 scores in both IOB (84.06) and Non-IOB (88.23) settings. The dataset and models are publicly available, enabling real-time user-centric information extraction for public opinion, news, and marketing in Marathi and supporting cross-domain analysis with existing MahaNER resources.
Abstract
This work introduces the L3Cube-MahaSocialNER dataset, the first and largest social media dataset specifically designed for Named Entity Recognition (NER) in the Marathi language. The dataset comprises 18,000 manually labeled sentences covering eight entity classes, addressing challenges posed by social media data, including non-standard language and informal idioms. Deep learning models, including CNN, LSTM, BiLSTM, and Transformer models, are evaluated on the individual dataset with IOB and non-IOB notations. The results demonstrate the effectiveness of these models in accurately recognizing named entities in Marathi informal text. The L3Cube-MahaSocialNER dataset offers user-centric information extraction and supports real-time applications, providing a valuable resource for public opinion analysis, news, and marketing on social media platforms. We also show that the zero-shot results of the regular NER model are poor on the social NER test set thus highlighting the need for more social NER datasets. The datasets and models are publicly available at https://github.com/l3cube-pune/MarathiNLP
