Adversarial Evasion Attack Efficiency against Large Language Models
João Vitorino, Eva Maia, Isabel Praça
TL;DR
This paper addresses the vulnerability of Large Language Models (LLMs) used for text classification to adversarial perturbations that can induce misclassification with minimal edits and limited queries. It compares three adversarial evasion methods—BERTAttack (word-level), ChecklistAttack (constrained word-level), and TypoAttack (character-level)—across five transformer-based LLMs using the RottenTomatoes sentiment dataset, evaluating with Misclassification Rate ($MR$), Average Perturbed Words ($APW$), and Average Required Queries ($ARQ$). Findings show that word-level perturbations (BERTAttack) are highly effective, achieving near-100% $MR$ on several models; TypoAttack is also potent on larger models but demands many queries; ChecklistAttack remains more query-efficient but less effective due to its constraints. The work highlights trade-offs between attack strength and practicality, suggesting defense strategies such as real-time verification of query sequences to mitigate coordinated adversarial probing, and points to future work on broader attack/defense studies across more models and tasks.
Abstract
Large Language Models (LLMs) are valuable for text classification, but their vulnerabilities must not be disregarded. They lack robustness against adversarial examples, so it is pertinent to understand the impacts of different types of perturbations, and assess if those attacks could be replicated by common users with a small amount of perturbations and a small number of queries to a deployed LLM. This work presents an analysis of the effectiveness, efficiency, and practicality of three different types of adversarial attacks against five different LLMs in a sentiment classification task. The obtained results demonstrated the very distinct impacts of the word-level and character-level attacks. The word attacks were more effective, but the character and more constrained attacks were more practical and required a reduced number of perturbations and queries. These differences need to be considered during the development of adversarial defense strategies to train more robust LLMs for intelligent text classification applications.
