Table of Contents
Fetching ...

BAE: BERT-based Adversarial Examples for Text Classification

Siddhant Garg, Goutham Ramakrishnan

TL;DR

This paper introduces BAE, a black-box adversarial attack for text classification that leverages a BERT masked language model to contextually replace or insert tokens, producing more natural and coherent perturbations than prior synonym-based methods. By estimating token importance and applying four perturbation modes (R, I, R/I, R+I) with careful filtering by POS and semantic similarity, BAE achieves strong misclassification performance across multiple datasets and models. Automatic and human evaluations show that BAE attacks yield larger accuracy drops while maintaining higher grammaticality and semantic coherence than baselines like TextFooler. The work demonstrates the effectiveness of contextual perturbations for NLP security and provides insights into attack strength versus naturalness, with practical implications for robustness testing and defense design.

Abstract

Modern text classification models are susceptible to adversarial examples, perturbed versions of the original text indiscernible by humans which get misclassified by the model. Recent works in NLP use rule-based synonym replacement strategies to generate adversarial examples. These strategies can lead to out-of-context and unnaturally complex token replacements, which are easily identifiable by humans. We present BAE, a black box attack for generating adversarial examples using contextual perturbations from a BERT masked language model. BAE replaces and inserts tokens in the original text by masking a portion of the text and leveraging the BERT-MLM to generate alternatives for the masked tokens. Through automatic and human evaluations, we show that BAE performs a stronger attack, in addition to generating adversarial examples with improved grammaticality and semantic coherence as compared to prior work.

BAE: BERT-based Adversarial Examples for Text Classification

TL;DR

This paper introduces BAE, a black-box adversarial attack for text classification that leverages a BERT masked language model to contextually replace or insert tokens, producing more natural and coherent perturbations than prior synonym-based methods. By estimating token importance and applying four perturbation modes (R, I, R/I, R+I) with careful filtering by POS and semantic similarity, BAE achieves strong misclassification performance across multiple datasets and models. Automatic and human evaluations show that BAE attacks yield larger accuracy drops while maintaining higher grammaticality and semantic coherence than baselines like TextFooler. The work demonstrates the effectiveness of contextual perturbations for NLP security and provides insights into attack strength versus naturalness, with practical implications for robustness testing and defense design.

Abstract

Modern text classification models are susceptible to adversarial examples, perturbed versions of the original text indiscernible by humans which get misclassified by the model. Recent works in NLP use rule-based synonym replacement strategies to generate adversarial examples. These strategies can lead to out-of-context and unnaturally complex token replacements, which are easily identifiable by humans. We present BAE, a black box attack for generating adversarial examples using contextual perturbations from a BERT masked language model. BAE replaces and inserts tokens in the original text by masking a portion of the text and leveraging the BERT-MLM to generate alternatives for the masked tokens. Through automatic and human evaluations, we show that BAE performs a stronger attack, in addition to generating adversarial examples with improved grammaticality and semantic coherence as compared to prior work.

Paper Structure

This paper contains 9 sections, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: We use BERT-MLM to predict masked tokens in the text for generating adversarial examples. The MASK token replaces a word (BAE-R attack) or is inserted to the left/right of the word (BAE-I).
  • Figure 2: Automatic evaluation of adversarial attacks on MPQA, Subj and TREC datasets. Other details follow those from Table \ref{['tab:results1']}. All 4 modes of BAE attacks almost always outperform TextFooler.
  • Figure 3: Graphs comparing attack effectiveness on the TREC dataset, as a function of maximum % perturbation to the input.
  • Figure 4: Amazon
  • Figure 5: Yelp
  • ...and 4 more figures