Table of Contents
Fetching ...

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

Aishwarya Mirashi, Srushti Sonavane, Purva Lingayat, Tejas Padhiyar, Raviraj Joshi

TL;DR

The paper tackles the scarcity of labeled, multi-language Indic-language datasets for news classification by introducing L3Cube-IndicNews, a multilingual corpus spanning ten Indic languages with three length-adapted datasets: Short Headlines Classification (SHC), Long Document Classification (LDC), and Long Paragraph Classification (LPC). It describes comprehensive data collection from diverse news sources, careful preprocessing to respect Indic scripts, and a dataset size of over 26,000 records per sub-dataset with 10–12 categories, along with train/validation/test splits. The study benchmarks four families of BERT-based models—monolingual BERTs, monolingual SBERTs, IndicSBERT (multilingual), and indicBERT—across SHC, LDC, and LPC, finding that long documents (LDC) typically yield higher accuracy and that BengaliBERT achieves strong performance on LDC. The work provides a valuable resource for Indic NLP, enabling length-based analysis and cross-lingual studies, and sets a benchmark for future research in topic classification for Indian languages. The datasets and models are publicly available, fostering broad adoption and further development in Indic-language NLP.

Abstract

In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3 datasets for in-depth length-based analysis. We evaluate each of these Indic language datasets using 4 different models including monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes significantly to expanding the pool of available text classification datasets and also makes it possible to develop topic classification models for Indian regional languages. This also serves as an excellent resource for cross-lingual analysis owing to the high overlap of labels among languages. The datasets and models are shared publicly at https://github.com/l3cube-pune/indic-nlp

L3Cube-IndicNews: News-based Short Text and Long Document Classification Datasets in Indic Languages

TL;DR

The paper tackles the scarcity of labeled, multi-language Indic-language datasets for news classification by introducing L3Cube-IndicNews, a multilingual corpus spanning ten Indic languages with three length-adapted datasets: Short Headlines Classification (SHC), Long Document Classification (LDC), and Long Paragraph Classification (LPC). It describes comprehensive data collection from diverse news sources, careful preprocessing to respect Indic scripts, and a dataset size of over 26,000 records per sub-dataset with 10–12 categories, along with train/validation/test splits. The study benchmarks four families of BERT-based models—monolingual BERTs, monolingual SBERTs, IndicSBERT (multilingual), and indicBERT—across SHC, LDC, and LPC, finding that long documents (LDC) typically yield higher accuracy and that BengaliBERT achieves strong performance on LDC. The work provides a valuable resource for Indic NLP, enabling length-based analysis and cross-lingual studies, and sets a benchmark for future research in topic classification for Indian languages. The datasets and models are publicly available, fostering broad adoption and further development in Indic-language NLP.

Abstract

In this work, we introduce L3Cube-IndicNews, a multilingual text classification corpus aimed at curating a high-quality dataset for Indian regional languages, with a specific focus on news headlines and articles. We have centered our work on 10 prominent Indic languages, including Hindi, Bengali, Marathi, Telugu, Tamil, Gujarati, Kannada, Odia, Malayalam, and Punjabi. Each of these news datasets comprises 10 or more classes of news articles. L3Cube-IndicNews offers 3 distinct datasets tailored to handle different document lengths that are classified as: Short Headlines Classification (SHC) dataset containing the news headline and news category, Long Document Classification (LDC) dataset containing the whole news article and the news category, and Long Paragraph Classification (LPC) containing sub-articles of the news and the news category. We maintain consistent labeling across all 3 datasets for in-depth length-based analysis. We evaluate each of these Indic language datasets using 4 different models including monolingual BERT, multilingual Indic Sentence BERT (IndicSBERT), and IndicBERT. This research contributes significantly to expanding the pool of available text classification datasets and also makes it possible to develop topic classification models for Indian regional languages. This also serves as an excellent resource for cross-lingual analysis owing to the high overlap of labels among languages. The datasets and models are shared publicly at https://github.com/l3cube-pune/indic-nlp
Paper Structure (17 sections, 6 figures, 6 tables)

This paper contains 17 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: SHC Dataset Overview
  • Figure 2: LPC Dataset Overview
  • Figure 3: LDC Dataset Overview
  • Figure 4: Confusion matrix for the Kannada SHC dataset
  • Figure 5: Confusion matrix for the Kannada LDC dataset
  • ...and 1 more figures