Revisiting Character-level Adversarial Attacks for Language Models

Elias Abad Rocamora; Yongtao Wu; Fanghui Liu; Grigorios G. Chrysos; Volkan Cevher

Revisiting Character-level Adversarial Attacks for Language Models

Elias Abad Rocamora, Yongtao Wu, Fanghui Liu, Grigorios G. Chrysos, Volkan Cevher

TL;DR

This work introduces Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples, which successfully targets both small (BERT) and large (Llama 2) models.

Abstract

Adversarial attacks in Natural Language Processing apply perturbations in the character or token levels. Token-level attacks, gaining prominence for their use of gradient-based methods, are susceptible to altering sentence semantics, leading to invalid adversarial examples. While character-level attacks easily maintain semantics, they have received less attention as they cannot easily adopt popular gradient-based methods, and are thought to be easy to defend. Challenging these beliefs, we introduce Charmer, an efficient query-based adversarial attack capable of achieving high attack success rate (ASR) while generating highly similar adversarial examples. Our method successfully targets both small (BERT) and large (Llama 2) models. Specifically, on BERT with SST-2, Charmer improves the ASR in 4.84% points and the USE similarity in 8% points with respect to the previous art. Our implementation is available in https://github.com/LIONS-EPFL/Charmer.

Revisiting Character-level Adversarial Attacks for Language Models

TL;DR

Abstract

Paper Structure (32 sections, 2 theorems, 26 equations, 6 figures, 25 tables, 3 algorithms)

This paper contains 32 sections, 2 theorems, 26 equations, 6 figures, 25 tables, 3 algorithms.

Introduction
Related Work
Problem Formulation
The Sentence Space
Adversarial Robustness
Characterizing the Perturbations
Method
Pre-selection of Replacement Locations
Attack Classifier
Attack LLM
Experiments
Selecting the Number of Positions
Comparison Against State-of-the-art Attacks
Adversarial Training
Bypassing Typo-correctors
...and 17 more sections

Key Result

Proposition 3.6

Let $S \in \mathcal{S}(\Gamma)$ be a non-empty sentence, and $S'$ be another sentence satisfying $d_{\text{lev}}(S,S')=1$. Then we can find $i \in [2|S| + 1]$ and a character $c \in \Gamma\cup \{\xi\}$ such that

Figures (6)

Figure 1: Schematic of the proposed method, Charmer: Example of our attack in the sentiment classification task with the positions subset size $n=3$. At each iteration, our attack computes the most important positions in the sentence via \ref{['alg:position_selection']}. Then, we generate all possible sentences replacing a character in the top positions, to get the one with the highest loss. If this sentence is misclassified, the process is finished.
Figure 2: Selection of the number of candidate positions: Attack Success Rate (ASR) at $k=1$ (● left axis) and runtime (● right axis) for our candidate position selection strategy (\ref{['alg:position_selection']}, bold lines) and a random selection (Random, dotted lines). Our strategy improves the random baseline at a small cost ($\approx 0.25 s$).
Figure 3: Adversarial Training evolution: When employing Charmer as a defense, clean and character-level accuracies grow consistently through training steps, while token-level (TextFooler) accuracy is unimproved. The TextGrad defense consistently improves the token-level accuracy at the cost of hindering clean and character-level accuracy, which grow in the first $\approx 400$ steps to then start decreasing.
Figure S4: Top 20 most common replacements with Charmer: The pair of characters $(c_1,c_2)$ indicates that $c_1$ is replaced by $c_2$ in the sentence. If $c_1 = \xi$, the replacement represents an insertion and if $c_2 = \xi$ the operation represents a deletion. The special character is denoted as $\xi$ as the Greek character $\xi$ did not appear in the most common operations. The most common operations are insertions of punctuation and special characters.
Figure S5: Distribution of the relative location of perturbations in the sentence with Charmer:$0$ and $1$ represent an insertion before the first character and after the last character in the sentence respectively. We do not observe any tendency in the QNLI, RTE and SST-2 datasets. For AG-News and QNLI, the perturbations in locations closer to $0$ appear to be more common.
...and 1 more figures

Theorems & Definitions (12)

Example 3.1: $d_{\text{lev}}$ from $S = \text{Hello}$ to several modifications.
Definition 3.2: $k$-robustness at $S$
Definition 3.3: Expansion and contraction operators
Example 3.4
Definition 3.5: Replacement operator
Proposition 3.6: Characterization of $d_{\text{lev}}$-$1$ operations
Remark 3.7: Non-uniqueness
Remark 3.8: Intuition
Corollary 3.9: Generating $\mathcal{S}_k$
Remark 3.10
...and 2 more

Revisiting Character-level Adversarial Attacks for Language Models

TL;DR

Abstract

Revisiting Character-level Adversarial Attacks for Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (12)