Table of Contents
Fetching ...

A Grey-box Text Attack Framework using Explainable AI

Esther Chiramal, Kelvin Soh Boon Kai

TL;DR

This work tackles adversarial vulnerabilities in NLP by proposing a grey-box text attack framework that leverages Explainable AI to identify impactful word substitutions without requiring access to target model gradients. It combines LIME-based word-contribution analysis, synonym substitutions, and multiple surrogate Transformer architectures to craft semantically similar adversarial sentences and evaluate transferability across architectures. A transferability criterion is defined as $N_s \geq \lceil N/2 \rceil$, ensuring that an attack is considered effective if it fools a majority of surrogates, and results show cross-architecture transfer to unseen targets. The framework highlights practical risks of XAI-enabled vulnerabilities and offers a testing paradigm for robustness, with potential enhancements via post-processing techniques such as Unicode variants. Overall, it provides a gradient-free, scalable approach for assessing and improving NLP model robustness against explainable AI-driven adversarial attacks.

Abstract

Explainable AI is a strong strategy implemented to understand complex black-box model predictions in a human interpretable language. It provides the evidence required to execute the use of trustworthy and reliable AI systems. On the other hand, however, it also opens the door to locating possible vulnerabilities in an AI model. Traditional adversarial text attack uses word substitution, data augmentation techniques and gradient-based attacks on powerful pre-trained Bidirectional Encoder Representations from Transformers (BERT) variants to generate adversarial sentences. These attacks are generally whitebox in nature and not practical as they can be easily detected by humans E.g. Changing the word from "Poor" to "Rich". We proposed a simple yet effective Grey-box cum Black-box approach that does not require the knowledge of the model while using a set of surrogate Transformer/BERT models to perform the attack using Explainable AI techniques. As Transformers are the current state-of-the-art models for almost all Natural Language Processing (NLP) tasks, an attack generated from BERT1 is transferable to BERT2. This transferability is made possible due to the attention mechanism in the transformer that allows the model to capture long-range dependencies in a sequence. Using the power of BERT generalisation via attention, we attempt to exploit how transformers learn by attacking a few surrogate transformer variants which are all based on a different architecture. We demonstrate that this approach is highly effective to generate semantically good sentences by changing as little as one word that is not detectable by humans while still fooling other BERT models.

A Grey-box Text Attack Framework using Explainable AI

TL;DR

This work tackles adversarial vulnerabilities in NLP by proposing a grey-box text attack framework that leverages Explainable AI to identify impactful word substitutions without requiring access to target model gradients. It combines LIME-based word-contribution analysis, synonym substitutions, and multiple surrogate Transformer architectures to craft semantically similar adversarial sentences and evaluate transferability across architectures. A transferability criterion is defined as , ensuring that an attack is considered effective if it fools a majority of surrogates, and results show cross-architecture transfer to unseen targets. The framework highlights practical risks of XAI-enabled vulnerabilities and offers a testing paradigm for robustness, with potential enhancements via post-processing techniques such as Unicode variants. Overall, it provides a gradient-free, scalable approach for assessing and improving NLP model robustness against explainable AI-driven adversarial attacks.

Abstract

Explainable AI is a strong strategy implemented to understand complex black-box model predictions in a human interpretable language. It provides the evidence required to execute the use of trustworthy and reliable AI systems. On the other hand, however, it also opens the door to locating possible vulnerabilities in an AI model. Traditional adversarial text attack uses word substitution, data augmentation techniques and gradient-based attacks on powerful pre-trained Bidirectional Encoder Representations from Transformers (BERT) variants to generate adversarial sentences. These attacks are generally whitebox in nature and not practical as they can be easily detected by humans E.g. Changing the word from "Poor" to "Rich". We proposed a simple yet effective Grey-box cum Black-box approach that does not require the knowledge of the model while using a set of surrogate Transformer/BERT models to perform the attack using Explainable AI techniques. As Transformers are the current state-of-the-art models for almost all Natural Language Processing (NLP) tasks, an attack generated from BERT1 is transferable to BERT2. This transferability is made possible due to the attention mechanism in the transformer that allows the model to capture long-range dependencies in a sequence. Using the power of BERT generalisation via attention, we attempt to exploit how transformers learn by attacking a few surrogate transformer variants which are all based on a different architecture. We demonstrate that this approach is highly effective to generate semantically good sentences by changing as little as one word that is not detectable by humans while still fooling other BERT models.

Paper Structure

This paper contains 22 sections, 1 equation, 8 figures, 7 tables.

Figures (8)

  • Figure 1: TextAttack framework for adversarial attacks, data augmentation, and model training in NLP
  • Figure 2: Framework architecture from “Grey-box Adversarial Attack and Defence for Sentiment Classification”
  • Figure 3: Toy example of the concept behind LIME (Image from “Why Should I Trust You?” Explaining the Predictions of Any Classifier” cite:xai)
  • Figure 4: Self-Attention compares all input sequence tokens (Square blocks) with each other allowing them to understand long range dependencies between tokens.
  • Figure 5: LIME output for a given text along with a list displaying word contributions.
  • ...and 3 more figures