Table of Contents
Fetching ...

Adversarial Attack for Explanation Robustness of Rationalization Models

Yuankai Zhang, Lingxiao Kong, Haozhao Wang, Ruixuan Li, Jun Wang, Yuhua Li, Wei Liu

TL;DR

The proposed UAT2E aims to undermine the explainability of rationalization models without altering their predictions, thereby eliciting distrust in these models from human users, and makes a series of recommendations for improving rationalization models in terms of explanation.

Abstract

Rationalization models, which select a subset of input text as rationale-crucial for humans to understand and trust predictions-have recently emerged as a prominent research area in eXplainable Artificial Intelligence. However, most of previous studies mainly focus on improving the quality of the rationale, ignoring its robustness to malicious attack. Specifically, whether the rationalization models can still generate high-quality rationale under the adversarial attack remains unknown. To explore this, this paper proposes UAT2E, which aims to undermine the explainability of rationalization models without altering their predictions, thereby eliciting distrust in these models from human users. UAT2E employs the gradient-based search on triggers and then inserts them into the original input to conduct both the non-target and target attack. Experimental results on five datasets reveal the vulnerability of rationalization models in terms of explanation, where they tend to select more meaningless tokens under attacks. Based on this, we make a series of recommendations for improving rationalization models in terms of explanation.

Adversarial Attack for Explanation Robustness of Rationalization Models

TL;DR

The proposed UAT2E aims to undermine the explainability of rationalization models without altering their predictions, thereby eliciting distrust in these models from human users, and makes a series of recommendations for improving rationalization models in terms of explanation.

Abstract

Rationalization models, which select a subset of input text as rationale-crucial for humans to understand and trust predictions-have recently emerged as a prominent research area in eXplainable Artificial Intelligence. However, most of previous studies mainly focus on improving the quality of the rationale, ignoring its robustness to malicious attack. Specifically, whether the rationalization models can still generate high-quality rationale under the adversarial attack remains unknown. To explore this, this paper proposes UAT2E, which aims to undermine the explainability of rationalization models without altering their predictions, thereby eliciting distrust in these models from human users. UAT2E employs the gradient-based search on triggers and then inserts them into the original input to conduct both the non-target and target attack. Experimental results on five datasets reveal the vulnerability of rationalization models in terms of explanation, where they tend to select more meaningless tokens under attacks. Based on this, we make a series of recommendations for improving rationalization models in terms of explanation.
Paper Structure (22 sections, 18 equations, 8 figures, 10 tables, 1 algorithm)

This paper contains 22 sections, 18 equations, 8 figures, 10 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of ML models with clean input and crafted input separately. (a) ML models not only returns correct prediction but also provides the comprehensible explanation to human user. (b) The explanation provided by ML models is incomprehensible for the crafted input.
  • Figure 2: (a) An example from a beer review sentiment classification dataset with correct prediction and rationale. (b) Inserting "the tea looks horrible ." causes the rationalizer to select "tea", "smell", and "grain", leading to an incorrect prediction. (c) Inserting "yet coincidentally first as given" results in maintaining a correct prediction but with an obviously incorrect rationale. The underline, red, and yellow represent human-annotated rationale, triggers, and selected rationales, respectively.
  • Figure 3: Examples of label sequences under non-target and target attacks. Attack triggers are highlighted in red. Grey indicates 0, and pink indicates 1.
  • Figure 4: Comparison across different settings. We compare three settings: (a) different models, (b) using BERT or GRU as an encoder, and (c) unsupervised training and supervised training with human-annotated rationales. The comparison is conducted using five models and five datasets.
  • Figure 5: Evaluating the impact of improving prediction robustness on explanation robustness. We train RNP on the Movie (a) and MultiRC datasets (b). "w/o adv" and "w/ adv" represent the cases without and with adversarial training, respectively.
  • ...and 3 more figures