Unveiling Vulnerability of Self-Attention

Khai Jiet Liong; Hongqiu Wu; Hai Zhao

Unveiling Vulnerability of Self-Attention

Khai Jiet Liong, Hongqiu Wu, Hai Zhao

TL;DR

This work reveals vulnerability in transformer-based PLMs by perturbing the self-attention mechanism rather than input text. It introduces HackAttend, a greedy, gradient-guided method that crafts minimal-attention-mask perturbations ($\alpha$ as low as $1\%$) to induce misclassifications with high ASR (up to $98\%$ on several tasks). To counter this, it proposes S-Attend, a lightweight smoothing technique that randomizes attention during training to achieve robustness comparable to adversarial training on multiple attackers. Together, HackAttend and S-Attend illuminate a structural fragility in SA and offer a practical defense that improves robustness with modest impact on clean accuracy, advancing robustness research for PLMs.

Abstract

Pre-trained language models (PLMs) are shown to be vulnerable to minor word changes, which poses a big threat to real-world systems. While previous studies directly focus on manipulating word inputs, they are limited by their means of generating adversarial samples, lacking generalization to versatile real-world attack. This paper studies the basic structure of transformer-based PLMs, the self-attention (SA) mechanism. (1) We propose a powerful perturbation technique \textit{HackAttend}, which perturbs the attention scores within the SA matrices via meticulously crafted attention masks. We show that state-of-the-art PLMs fall into heavy vulnerability that minor attention perturbations $(1\%)$ can produce a very high attack success rate $(98\%)$. Our paper expands the conventional text attack of word perturbations to more general structural perturbations. (2) We introduce \textit{S-Attend}, a novel smoothing technique that effectively makes SA robust via structural perturbations. We empirically demonstrate that this simple yet effective technique achieves robust performance on par with adversarial training when facing various text attackers. Code is publicly available at \url{github.com/liongkj/HackAttend}.

Unveiling Vulnerability of Self-Attention

TL;DR

as low as

) to induce misclassifications with high ASR (up to

on several tasks). To counter this, it proposes S-Attend, a lightweight smoothing technique that randomizes attention during training to achieve robustness comparable to adversarial training on multiple attackers. Together, HackAttend and S-Attend illuminate a structural fragility in SA and offer a practical defense that improves robustness with modest impact on clean accuracy, advancing robustness research for PLMs.

Abstract

can produce a very high attack success rate

. Our paper expands the conventional text attack of word perturbations to more general structural perturbations. (2) We introduce \textit{S-Attend}, a novel smoothing technique that effectively makes SA robust via structural perturbations. We empirically demonstrate that this simple yet effective technique achieves robust performance on par with adversarial training when facing various text attackers. Code is publicly available at \url{github.com/liongkj/HackAttend}.

Paper Structure (36 sections, 2 equations, 5 figures, 9 tables, 1 algorithm)

This paper contains 36 sections, 2 equations, 5 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Introduction to HackAttend
Notations
Overview of Hackattend
Layers and Head Selection
Layers Selection
Heads Selection
SA Units Selection
Constraints
Algorithm
Empirical Experiment
Evaluation Settings
Hamming Distance
Clean Accuracy
...and 21 more sections

Figures (5)

Figure 1: Illustration of the process of HackAttend manipulating the SA mechanism. The perturbation of SA units (highlighted in orange) demonstrates how the algorithm induces misclassification in sentiment analysis by flipping the activation states. The example perturbation shows the transition from a positive to a negative sentiment interpretation, providing a visual representation of HackAttend's effect on the model's decision-making process for the SST-2 dataset.
Figure 2: Placement of mask hook
Figure 3: Normalized count of successful perturbation grouped by layers chosen and task. -1 indicates a failed perturbation.
Figure 4: Comparison of attention maps on the 2nd layer, 10th head of BERT-base before and after perturbation. The sample is "A wildly inconsistent emotional experience." from SST-2. The sentence is classified as positive after the perturbation.
Figure 5: Comparison of attention maps for the text a solid examination of the male midlife crisis. from SST-2 on the 10th layer, 4th head of BERT-base between normal model (on clean and BERT-Attack augmented dataset), and S-Attend model (on augmented dataset). After the adversarial attack, the normal model misclassifies as negative, while S-Attend model correctly maintains as positive.

Unveiling Vulnerability of Self-Attention

TL;DR

Abstract

Unveiling Vulnerability of Self-Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (5)