Table of Contents
Fetching ...

BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion

Md Osama, Ashim Dey, Kawsar Ahmed, Muhammad Ashad Kabir

TL;DR

The paper tackles Bengali religious news headline generation by addressing the paucity of contextual cues in traditional content-only approaches. It introduces BeliN, a 2,520-sample Bengali religious news corpus with labeled category, aspect, and sentiment, and proposes MultiGen, a multi-input headline generator that fuses article content with contextual features using transformer-based models such as BanglaT5, mT5, mT0, and mBART. Empirical results show MultiGen consistently surpasses content-only baselines, with BanglaT5 achieving the best overall gains (e.g., BLEU 18.61 and ROUGE-L 24.19 vs baselines). The work demonstrates that incorporating category, aspect, and sentiment meaningfully improves headline quality in a low-resource language, offering a reproducible resource and a strong foundation for further NLP research in Bengali and similar languages.

Abstract

Automatic text summarization, particularly headline generation, remains a critical yet underexplored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) - comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features - including category, aspect, and sentiment - with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other underrepresented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at https://github.com/akabircs/BeliN.

BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion

TL;DR

The paper tackles Bengali religious news headline generation by addressing the paucity of contextual cues in traditional content-only approaches. It introduces BeliN, a 2,520-sample Bengali religious news corpus with labeled category, aspect, and sentiment, and proposes MultiGen, a multi-input headline generator that fuses article content with contextual features using transformer-based models such as BanglaT5, mT5, mT0, and mBART. Empirical results show MultiGen consistently surpasses content-only baselines, with BanglaT5 achieving the best overall gains (e.g., BLEU 18.61 and ROUGE-L 24.19 vs baselines). The work demonstrates that incorporating category, aspect, and sentiment meaningfully improves headline quality in a low-resource language, offering a reproducible resource and a strong foundation for further NLP research in Bengali and similar languages.

Abstract

Automatic text summarization, particularly headline generation, remains a critical yet underexplored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) - comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features - including category, aspect, and sentiment - with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other underrepresented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at https://github.com/akabircs/BeliN.
Paper Structure (32 sections, 15 equations, 4 figures, 11 tables)

This paper contains 32 sections, 15 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: An overview of multi-input headline generation
  • Figure 2: An overview of BeliN corpus development process.
  • Figure 3: Distribution of article and headline lengths in the dataset. (a) Frequency of article lengths (in words), illustrating the common word counts for articles. (b) Frequency of headline lengths (in words), highlighting the typical brevity or elaboration of headlines compared to the full articles.
  • Figure 4: MultiGen architecture for headline generation