Table of Contents
Fetching ...

LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

Hai Zhu, Zhaoqing Yang, Weiwei Shang, Yuren Wu

TL;DR

This work tackles the realistic hard-label adversarial setting in NLP by introducing LimeAttack, which uses LIME-like local explanations to estimate word importance and beam search to craft high-quality adversarial text under a tiny query budget. By bridging score-based and hard-label attacks, LimeAttack outperforms existing hard-label baselines in attack success and perturbation quality across multiple datasets and models, including large language models. The approach demonstrates transferability of adversarial examples and improves adversarial robustness through targeted evaluation and ablations, while confirming continued threats to modern NLP systems. Human evaluation corroborates the quality and fluency of the adversaries, highlighting practical implications for defense and robust model design.

Abstract

Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt gradients or confidence scores to calculate word importance ranking and generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hard-label attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on large language models, and results indicate that adversarial examples remain a significant threat to large language models. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.

LimeAttack: Local Explainable Method for Textual Hard-Label Adversarial Attack

TL;DR

This work tackles the realistic hard-label adversarial setting in NLP by introducing LimeAttack, which uses LIME-like local explanations to estimate word importance and beam search to craft high-quality adversarial text under a tiny query budget. By bridging score-based and hard-label attacks, LimeAttack outperforms existing hard-label baselines in attack success and perturbation quality across multiple datasets and models, including large language models. The approach demonstrates transferability of adversarial examples and improves adversarial robustness through targeted evaluation and ablations, while confirming continued threats to modern NLP systems. Human evaluation corroborates the quality and fluency of the adversaries, highlighting practical implications for defense and robust model design.

Abstract

Natural language processing models are vulnerable to adversarial examples. Previous textual adversarial attacks adopt gradients or confidence scores to calculate word importance ranking and generate adversarial examples. However, this information is unavailable in the real world. Therefore, we focus on a more realistic and challenging setting, named hard-label attack, in which the attacker can only query the model and obtain a discrete prediction label. Existing hard-label attack algorithms tend to initialize adversarial examples by random substitution and then utilize complex heuristic algorithms to optimize the adversarial perturbation. These methods require a lot of model queries and the attack success rate is restricted by adversary initialization. In this paper, we propose a novel hard-label attack algorithm named LimeAttack, which leverages a local explainable method to approximate word importance ranking, and then adopts beam search to find the optimal solution. Extensive experiments show that LimeAttack achieves the better attacking performance compared with existing hard-label attack under the same query budget. In addition, we evaluate the effectiveness of LimeAttack on large language models, and results indicate that adversarial examples remain a significant threat to large language models. The adversarial examples crafted by LimeAttack are highly transferable and effectively improve model robustness in adversarial training.
Paper Structure (47 sections, 9 equations, 6 figures, 28 tables, 1 algorithm)

This paper contains 47 sections, 9 equations, 6 figures, 28 tables, 1 algorithm.

Figures (6)

  • Figure 1: Search paths of existing hard-label attacks and LimeAttack.
  • Figure 2: Overview of LimeAttack. It consists of two modules, i.e., word importance ranking and perturbation execution. We first generate some neighborhood examples by masking some words in the benign sample, and then adopt linear model to approximate the importance of each word $x_i$. Then, we select candidate sets in the counter-fitted embedding space for each word. Finally, we adopt beam search (beam size $b=2$ in the figure) to generate adversarial examples iteratively.
  • Figure 3: Attack success rate of different attacks under different query budgets on CNN-MR.
  • Figure 4: The attack success rate (%) $\uparrow$, perturbation rate (%) $\downarrow$ and semantic similarity(%) $\uparrow$ LimeAttack on BERT using MR and SST-2 dataset under different beam size $b$
  • Figure 5: Transferability of adversarial examples on MR dataset for BERT. Lower accuracy indicates higher transferability.
  • ...and 1 more figures