Table of Contents
Fetching ...

Tackling Fake News in Bengali: Unraveling the Impact of Summarization vs. Augmentation on Pre-trained Language Models

Arman Sakif Chowdhury, G. M. Shahariar, Ahammed Tarik Aziz, Syed Mohibul Alam, Md. Azad Sheikh, Tanveer Ahmed Belal

TL;DR

This work tackles Bengali fake-news detection by addressing data scarcity and long-text inputs through a four-approach framework that combines summarization and augmentation with five pre-trained transformers. It builds training data by merging BanFakeNews with translated English fake-news (TransFND) and introduces a controlled augmented training set to balance classes, along with a summarization pipeline to cope with 512-token limits. Evaluations on three test sets show high accuracy, with BanglaBERT-based models excelling when augmentation and summarization are used, and mBERT-based variants achieving strong generalization on unseen data. The study provides publicly available datasets and code, demonstrating that summarization and augmentation can significantly boost Bengali fake-news detection and generalization, with implications for low-resource languages and cross-dataset robustness.

Abstract

With the rise of social media and online news sources, fake news has become a significant issue globally. However, the detection of fake news in low resource languages like Bengali has received limited attention in research. In this paper, we propose a methodology consisting of four distinct approaches to classify fake news articles in Bengali using summarization and augmentation techniques with five pre-trained language models. Our approach includes translating English news articles and using augmentation techniques to curb the deficit of fake news articles. Our research also focused on summarizing the news to tackle the token length limitation of BERT based models. Through extensive experimentation and rigorous evaluation, we show the effectiveness of summarization and augmentation in the case of Bengali fake news detection. We evaluated our models using three separate test datasets. The BanglaBERT Base model, when combined with augmentation techniques, achieved an impressive accuracy of 96% on the first test dataset. On the second test dataset, the BanglaBERT model, trained with summarized augmented news articles achieved 97% accuracy. Lastly, the mBERT Base model achieved an accuracy of 86% on the third test dataset which was reserved for generalization performance evaluation. The datasets and implementations are available at https://github.com/arman-sakif/Bengali-Fake-News-Detection

Tackling Fake News in Bengali: Unraveling the Impact of Summarization vs. Augmentation on Pre-trained Language Models

TL;DR

This work tackles Bengali fake-news detection by addressing data scarcity and long-text inputs through a four-approach framework that combines summarization and augmentation with five pre-trained transformers. It builds training data by merging BanFakeNews with translated English fake-news (TransFND) and introduces a controlled augmented training set to balance classes, along with a summarization pipeline to cope with 512-token limits. Evaluations on three test sets show high accuracy, with BanglaBERT-based models excelling when augmentation and summarization are used, and mBERT-based variants achieving strong generalization on unseen data. The study provides publicly available datasets and code, demonstrating that summarization and augmentation can significantly boost Bengali fake-news detection and generalization, with implications for low-resource languages and cross-dataset robustness.

Abstract

With the rise of social media and online news sources, fake news has become a significant issue globally. However, the detection of fake news in low resource languages like Bengali has received limited attention in research. In this paper, we propose a methodology consisting of four distinct approaches to classify fake news articles in Bengali using summarization and augmentation techniques with five pre-trained language models. Our approach includes translating English news articles and using augmentation techniques to curb the deficit of fake news articles. Our research also focused on summarizing the news to tackle the token length limitation of BERT based models. Through extensive experimentation and rigorous evaluation, we show the effectiveness of summarization and augmentation in the case of Bengali fake news detection. We evaluated our models using three separate test datasets. The BanglaBERT Base model, when combined with augmentation techniques, achieved an impressive accuracy of 96% on the first test dataset. On the second test dataset, the BanglaBERT model, trained with summarized augmented news articles achieved 97% accuracy. Lastly, the mBERT Base model achieved an accuracy of 86% on the third test dataset which was reserved for generalization performance evaluation. The datasets and implementations are available at https://github.com/arman-sakif/Bengali-Fake-News-Detection
Paper Structure (32 sections, 5 equations, 8 figures, 14 tables)

This paper contains 32 sections, 5 equations, 8 figures, 14 tables.

Figures (8)

  • Figure 1: Pipeline of the summarization process.
  • Figure 2: Schematic diagram of the proposed approaches.
  • Figure 3: Pipeline of the augmentation process.
  • Figure 4: Pipeline of the summarization process.
  • Figure 5: Training and validation loss vs Accuracy for TM-mBERT.
  • ...and 3 more figures