Table of Contents
Fetching ...

NSINA: A News Corpus for Sinhala

Hansi Hettiarachchi, Damith Premasiri, Lasitha Uyangodage, Tharindu Ranasinghe

TL;DR

NSina addresses Sinhala NLP data scarcity by introducing a large-scale news corpus (NSina) with 506,932 articles from ten Sri Lankan outlets and three evaluation tasks: news media identification, news category prediction, and news headline generation. The authors benchmark multiple transformers (SinBERT, XLM-R variants, mBART, mT5) on the classification tasks, finding strong performance (Macro F1 around 0.88–0.94) but limited gains in generation due to the lack of Sinhala-specific generation models and evaluation metrics. By releasing NSina and its tasks, the work enables robust Sinhala NLP benchmarking and highlights the value of multilingual models while also revealing the ongoing challenges in language generation for low-resource languages.

Abstract

The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSINA, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSINA aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSINA is the largest news corpus for Sinhala, available up to date.

NSINA: A News Corpus for Sinhala

TL;DR

NSina addresses Sinhala NLP data scarcity by introducing a large-scale news corpus (NSina) with 506,932 articles from ten Sri Lankan outlets and three evaluation tasks: news media identification, news category prediction, and news headline generation. The authors benchmark multiple transformers (SinBERT, XLM-R variants, mBART, mT5) on the classification tasks, finding strong performance (Macro F1 around 0.88–0.94) but limited gains in generation due to the lack of Sinhala-specific generation models and evaluation metrics. By releasing NSina and its tasks, the work enables robust Sinhala NLP benchmarking and highlights the value of multilingual models while also revealing the ongoing challenges in language generation for low-resource languages.

Abstract

The introduction of large language models (LLMs) has advanced natural language processing (NLP), but their effectiveness is largely dependent on pre-training resources. This is especially evident in low-resource languages, such as Sinhala, which face two primary challenges: the lack of substantial training data and limited benchmarking datasets. In response, this study introduces NSINA, a comprehensive news corpus of over 500,000 articles from popular Sinhala news websites, along with three NLP tasks: news media identification, news category prediction, and news headline generation. The release of NSINA aims to provide a solution to challenges in adapting LLMs to Sinhala, offering valuable resources and benchmarks for improving NLP in the Sinhala language. NSINA is the largest news corpus for Sinhala, available up to date.
Paper Structure (22 sections, 2 figures, 4 tables)

This paper contains 22 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Token frequency distribution of news content in NSina
  • Figure 2: Token frequency distribution of news headline in NSina