BeliN: A Novel Corpus for Bengali Religious News Headline Generation using Contextual Feature Fusion
Md Osama, Ashim Dey, Kawsar Ahmed, Muhammad Ashad Kabir
TL;DR
The paper tackles Bengali religious news headline generation by addressing the paucity of contextual cues in traditional content-only approaches. It introduces BeliN, a 2,520-sample Bengali religious news corpus with labeled category, aspect, and sentiment, and proposes MultiGen, a multi-input headline generator that fuses article content with contextual features using transformer-based models such as BanglaT5, mT5, mT0, and mBART. Empirical results show MultiGen consistently surpasses content-only baselines, with BanglaT5 achieving the best overall gains (e.g., BLEU 18.61 and ROUGE-L 24.19 vs baselines). The work demonstrates that incorporating category, aspect, and sentiment meaningfully improves headline quality in a low-resource language, offering a reproducible resource and a strong foundation for further NLP research in Bengali and similar languages.
Abstract
Automatic text summarization, particularly headline generation, remains a critical yet underexplored area for Bengali religious news. Existing approaches to headline generation typically rely solely on the article content, overlooking crucial contextual features such as sentiment, category, and aspect. This limitation significantly hinders their effectiveness and overall performance. This study addresses this limitation by introducing a novel corpus, BeliN (Bengali Religious News) - comprising religious news articles from prominent Bangladeshi online newspapers, and MultiGen - a contextual multi-input feature fusion headline generation approach. Leveraging transformer-based pre-trained language models such as BanglaT5, mBART, mT5, and mT0, MultiGen integrates additional contextual features - including category, aspect, and sentiment - with the news content. This fusion enables the model to capture critical contextual information often overlooked by traditional methods. Experimental results demonstrate the superiority of MultiGen over the baseline approach that uses only news content, achieving a BLEU score of 18.61 and ROUGE-L score of 24.19, compared to baseline approach scores of 16.08 and 23.08, respectively. These findings underscore the importance of incorporating contextual features in headline generation for low-resource languages. By bridging linguistic and cultural gaps, this research advances natural language processing for Bengali and other underrepresented languages. To promote reproducibility and further exploration, the dataset and implementation code are publicly accessible at https://github.com/akabircs/BeliN.
