Verifying the Robustness of Automatic Credibility Assessment

Piotr Przybyła; Alexander Shvets; Horacio Saggion

Verifying the Robustness of Automatic Credibility Assessment

Piotr Przybyła, Alexander Shvets, Horacio Saggion

TL;DR

This paper addresses the vulnerability of automatic credibility assessment to adversarial text modifications by introducing BODEGA, a benchmark that simulates realistic moderation scenarios across four misinformation tasks. It evaluates multiple attack methods against four victim architectures (BiLSTM, BERT, GEMMA2B, GEMMA7B) under grey-box conditions, using a multi-faceted score—$BODEGA ext_ score$, $Sem ext_ score$, and $Char ext_ score$—to quantify both attack success and preservation of meaning. The results show that larger language models are not inherently more robust to adversarial examples, with some attacks achieving high success rates while maintaining semantic similarity, especially in longer texts; conversely, certain shorter tasks remain harder to attack. The study highlights the importance of comprehensive robustness testing, proposes practical mitigations (e.g., human-in-the-loop, adversarial training), and provides an open framework for ongoing evaluation and methodological development in credibility assessment. Overall, BODEGA offers a principled, extensible means to benchmark and improve the reliability of automated content credibility classifiers in adversarial settings.

Abstract

Text classification methods have been widely investigated as a way to detect content of low credibility: fake news, social media bots, propaganda, etc. Quite accurate models (likely based on deep neural networks) help in moderating public electronic platforms and often cause content creators to face rejection of their submissions or removal of already published texts. Having the incentive to evade further detection, content creators try to come up with a slightly modified version of the text (known as an attack with an adversarial example) that exploit the weaknesses of classifiers and result in a different output. Here we systematically test the robustness of common text classifiers against available attacking techniques and discover that, indeed, meaning-preserving changes in input text can mislead the models. The approaches we test focus on finding vulnerable spans in text and replacing individual characters or words, taking into account the similarity between the original and replacement content. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on four misinformation detection tasks in an evaluation framework designed to simulate real use-cases of content moderation. The attacked tasks include (1) fact checking and detection of (2) hyperpartisan news, (3) propaganda and (4) rumours. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions, e.g. attacks on GEMMA being up to 27\% more successful than those on BERT. Finally, we manually analyse a subset adversarial examples and check what kinds of modifications are used in successful attacks.

Verifying the Robustness of Automatic Credibility Assessment

TL;DR

, and

—to quantify both attack success and preservation of meaning. The results show that larger language models are not inherently more robust to adversarial examples, with some attacks achieving high success rates while maintaining semantic similarity, especially in longer texts; conversely, certain shorter tasks remain harder to attack. The study highlights the importance of comprehensive robustness testing, proposes practical mitigations (e.g., human-in-the-loop, adversarial training), and provides an open framework for ongoing evaluation and methodological development in credibility assessment. Overall, BODEGA offers a principled, extensible means to benchmark and improve the reliability of automated content credibility classifiers in adversarial settings.

Abstract

Paper Structure (33 sections, 2 equations, 3 figures, 11 tables)

This paper contains 33 sections, 2 equations, 3 figures, 11 tables.

Introduction
Related work
Adversarial examples in NLP
Robustness of credibility assessment
Resources for adversarial examples
Adversarial example generation
BODEGA tasks
HN: Hyperpartisan news
PR: Propaganda recognition
FC: Fact checking
RD: Rumour detection
Attack scenario
Evaluation
Semantic score
Character score
...and 18 more sections

Figures (3)

Figure 1: An overview of the evaluation of an adversarial attack using BODEGA. For each task, three datasets are available: development ($X_\text{dev}$), training ($X_\text{train}$) and attack ($X_\text{attack}$). During an evaluation of an attack involving an Attacker and Victim models from the library of available models, the Attacker takes the text of the $i$-th instance from the attack dataset ($x_i$), e.g. a news piece, and modifies it into an adversarial example ($x_i^*$). The Victim model is used to assess the credibility of both the original ($f(x_i)$) and modified text ($f(x_i^*)$). The BODEGA score assesses the quality of an AE, checking the similarity between the original and modified sample ($\text{sim}(x_i,x_i^*)$), as well as the change in the victim's output ($\text{diff}(f(x_i),f(x_i^*))$).
Figure 2: Classification performance (F1 score) and vulnerability to targeted attacks (BODEGA score) of models according to their size (parameter count, logarithmic scale), for different tasks.
Figure 3: Results of the targeted attacks (y axis, BODEGA score) plotted against the number of queries necessary (x axis, logarithmic) for various attack methods (symbols) and tasks (colours).

Verifying the Robustness of Automatic Credibility Assessment

TL;DR

Abstract

Verifying the Robustness of Automatic Credibility Assessment

Authors

TL;DR

Abstract

Table of Contents

Figures (3)