Low-Resource Counterspeech Generation for Indic Languages: The Case of Bengali and Hindi
Mithun Das, Saurabh Kumar Pandey, Shivansh Sethi, Punyajoy Saha, Animesh Mukherjee
TL;DR
This work tackles counterspeech generation for low-resource Indic languages by building a Bengali–Hindi benchmark of 5,062 AS-CS pairs and systematically evaluating mono- and cross-lingual generation with transformer-based models. It introduces seed data, annotation guidelines, and a robust dataset creation workflow, then compares monolingual, joint, and synthetic transfer strategies using models such as BanglaT5, mT5-base, BLOOM, GPT-2 variants, and ChatGPT. Across experiments, monolingual training generally yields the strongest results, while synthetic transfer is most effective within the same language family (Bengali–Hindi); zero-shot performance lags behind fine-tuned, gold-data-enabled approaches, though large language models can contribute diverse outputs. The study also includes post-editing evaluation and discusses ethical considerations, privacy, biases, and potential harms, ultimately releasing the dataset for research to spur further advances in counterspeech for low-resource languages.
Abstract
With the rise of online abuse, the NLP community has begun investigating the use of neural architectures to generate counterspeech that can "counter" the vicious tone of such abusive speech and dilute/ameliorate their rippling effect over the social network. However, most of the efforts so far have been primarily focused on English. To bridge the gap for low-resource languages such as Bengali and Hindi, we create a benchmark dataset of 5,062 abusive speech/counterspeech pairs, of which 2,460 pairs are in Bengali and 2,602 pairs are in Hindi. We implement several baseline models considering various interlingual transfer mechanisms with different configurations to generate suitable counterspeech to set up an effective benchmark. We observe that the monolingual setup yields the best performance. Further, using synthetic transfer, language models can generate counterspeech to some extent; specifically, we notice that transferability is better when languages belong to the same language family.
