Table of Contents
Fetching ...

Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo

Abhinaba Bala, Ashok Urlana, Rahul Mishra, Parameswari Krishnamurthy

TL;DR

This study tackles information scarcity in the Mizo language by proposing a cross-lingual enrichment pipeline that augments Mizo news with relevant English-language information. The workflow translates Mizo articles to English, generates headlines, retrieves and summarizes related English documents with PEGASUS, and translates the enriched results back to Mizo, producing 500 enriched articles. Human evaluation shows strong coherence and readability, with moderate enrichment and relevancy, indicating improved information coverage for Mizo news. The work demonstrates a practical approach to extending content in underrepresented languages and provides data and code to support further research in low-resource NLP.

Abstract

Obtaining sufficient information in one's mother tongue is crucial for satisfying the information needs of the users. While high-resource languages have abundant online resources, the situation is less than ideal for very low-resource languages. Moreover, the insufficient reporting of vital national and international events continues to be a worry, especially in languages with scarce resources, like \textbf{Mizo}. In this paper, we conduct a study to investigate the effectiveness of a simple methodology designed to generate a holistic summary for Mizo news articles, which leverages English-language news to supplement and enhance the information related to the corresponding news events. Furthermore, we make available 500 Mizo news articles and corresponding enriched holistic summaries. Human evaluation confirms that our approach significantly enhances the information coverage of Mizo news articles. The mizo dataset and code can be accessed at \url{https://github.com/barvin04/mizo_enrichment

Exploring News Summarization and Enrichment in a Highly Resource-Scarce Indian Language: A Case Study of Mizo

TL;DR

This study tackles information scarcity in the Mizo language by proposing a cross-lingual enrichment pipeline that augments Mizo news with relevant English-language information. The workflow translates Mizo articles to English, generates headlines, retrieves and summarizes related English documents with PEGASUS, and translates the enriched results back to Mizo, producing 500 enriched articles. Human evaluation shows strong coherence and readability, with moderate enrichment and relevancy, indicating improved information coverage for Mizo news. The work demonstrates a practical approach to extending content in underrepresented languages and provides data and code to support further research in low-resource NLP.

Abstract

Obtaining sufficient information in one's mother tongue is crucial for satisfying the information needs of the users. While high-resource languages have abundant online resources, the situation is less than ideal for very low-resource languages. Moreover, the insufficient reporting of vital national and international events continues to be a worry, especially in languages with scarce resources, like \textbf{Mizo}. In this paper, we conduct a study to investigate the effectiveness of a simple methodology designed to generate a holistic summary for Mizo news articles, which leverages English-language news to supplement and enhance the information related to the corresponding news events. Furthermore, we make available 500 Mizo news articles and corresponding enriched holistic summaries. Human evaluation confirms that our approach significantly enhances the information coverage of Mizo news articles. The mizo dataset and code can be accessed at \url{https://github.com/barvin04/mizo_enrichment
Paper Structure (18 sections, 1 figure, 5 tables)

This paper contains 18 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: The Enrichment Methodology Pipeline. This illustration outlines the sequential stages of the methodology, which encompasses a. data collection, b. preprocessing, transformation/translation, c. headline generation, d. multi-document summarization, and e. translation into the low-resource language. These stages collectively contribute to the enrichment of articles in low-resource languages, facilitating a comprehensive understanding and accessibility of the content.