Camouflage is all you need: Evaluating and Enhancing Language Model Robustness Against Camouflage Adversarial Attacks
Álvaro Huertas-García, Alejandro Martín, Javier Huertas-Tato, David Camacho
TL;DR
This work addresses the robustness of Transformer-based NLP models against camouflage adversarial attacks. It introduces a two-phase methodology: vulnerability assessment across encoder-only, decoder-only, and encoder-decoder models on offensive language and misinformation datasets, followed by resilience enhancement via adversarial training with pre-camouflaged and dynamically camouflaged data. Empirical results show substantial performance drops under camouflage (up to 26% on misinformation), and that adversarial training reduces drops to roughly 2–7% on average, with dynamic camouflage offering the strongest gains. An open-source camouflaged-dataset generator and external validation with AugLy bolster reproducibility, though effectiveness depends on camouflage type and data, underscoring the need for broader exploration and more defense strategies.
Abstract
Adversarial attacks represent a substantial challenge in Natural Language Processing (NLP). This study undertakes a systematic exploration of this challenge in two distinct phases: vulnerability evaluation and resilience enhancement of Transformer-based models under adversarial attacks. In the evaluation phase, we assess the susceptibility of three Transformer configurations, encoder-decoder, encoder-only, and decoder-only setups, to adversarial attacks of escalating complexity across datasets containing offensive language and misinformation. Encoder-only models manifest a 14% and 21% performance drop in offensive language detection and misinformation detection tasks, respectively. Decoder-only models register a 16% decrease in both tasks, while encoder-decoder models exhibit a maximum performance drop of 14% and 26% in the respective tasks. The resilience-enhancement phase employs adversarial training, integrating pre-camouflaged and dynamically altered data. This approach effectively reduces the performance drop in encoder-only models to an average of 5% in offensive language detection and 2% in misinformation detection tasks. Decoder-only models, occasionally exceeding original performance, limit the performance drop to 7% and 2% in the respective tasks. Although not surpassing the original performance, Encoder-decoder models can reduce the drop to an average of 6% and 2% respectively. Results suggest a trade-off between performance and robustness, with some models maintaining similar performance while gaining robustness. Our study and adversarial training techniques have been incorporated into an open-source tool for generating camouflaged datasets. However, methodology effectiveness depends on the specific camouflage technique and data encountered, emphasizing the need for continued exploration.
