Verifying the Robustness of Automatic Credibility Assessment
Piotr Przybyła, Alexander Shvets, Horacio Saggion
TL;DR
This paper addresses the vulnerability of automatic credibility assessment to adversarial text modifications by introducing BODEGA, a benchmark that simulates realistic moderation scenarios across four misinformation tasks. It evaluates multiple attack methods against four victim architectures (BiLSTM, BERT, GEMMA2B, GEMMA7B) under grey-box conditions, using a multi-faceted score—$BODEGA ext_ score$, $Sem ext_ score$, and $Char ext_ score$—to quantify both attack success and preservation of meaning. The results show that larger language models are not inherently more robust to adversarial examples, with some attacks achieving high success rates while maintaining semantic similarity, especially in longer texts; conversely, certain shorter tasks remain harder to attack. The study highlights the importance of comprehensive robustness testing, proposes practical mitigations (e.g., human-in-the-loop, adversarial training), and provides an open framework for ongoing evaluation and methodological development in credibility assessment. Overall, BODEGA offers a principled, extensible means to benchmark and improve the reliability of automated content credibility classifiers in adversarial settings.
Abstract
Text classification methods have been widely investigated as a way to detect content of low credibility: fake news, social media bots, propaganda, etc. Quite accurate models (likely based on deep neural networks) help in moderating public electronic platforms and often cause content creators to face rejection of their submissions or removal of already published texts. Having the incentive to evade further detection, content creators try to come up with a slightly modified version of the text (known as an attack with an adversarial example) that exploit the weaknesses of classifiers and result in a different output. Here we systematically test the robustness of common text classifiers against available attacking techniques and discover that, indeed, meaning-preserving changes in input text can mislead the models. The approaches we test focus on finding vulnerable spans in text and replacing individual characters or words, taking into account the similarity between the original and replacement content. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on four misinformation detection tasks in an evaluation framework designed to simulate real use-cases of content moderation. The attacked tasks include (1) fact checking and detection of (2) hyperpartisan news, (3) propaganda and (4) rumours. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions, e.g. attacks on GEMMA being up to 27\% more successful than those on BERT. Finally, we manually analyse a subset adversarial examples and check what kinds of modifications are used in successful attacks.
