Table of Contents
Fetching ...

SemRoDe: Macro Adversarial Training to Learn Representations That are Robust to Word-Level Attacks

Brian Formento, Wenjie Feng, Chuan Sheng Foo, Luu Anh Tuan, See-Kiong Ng

TL;DR

This work proposes a novel approach called Semantic Robust Defence (SemRoDe), a Macro Adversarial Training strategy to enhance the robustness of LMs, and aims to generalized across word embeddings, even when they share minimal overlap at both vocabulary and word-substitution levels.

Abstract

Language models (LMs) are indispensable tools for natural language processing tasks, but their vulnerability to adversarial attacks remains a concern. While current research has explored adversarial training techniques, their improvements to defend against word-level attacks have been limited. In this work, we propose a novel approach called Semantic Robust Defence (SemRoDe), a Macro Adversarial Training strategy to enhance the robustness of LMs. Drawing inspiration from recent studies in the image domain, we investigate and later confirm that in a discrete data setting such as language, adversarial samples generated via word substitutions do indeed belong to an adversarial domain exhibiting a high Wasserstein distance from the base domain. Our method learns a robust representation that bridges these two domains. We hypothesize that if samples were not projected into an adversarial domain, but instead to a domain with minimal shift, it would improve attack robustness. We align the domains by incorporating a new distance-based objective. With this, our model is able to learn more generalized representations by aligning the model's high-level output features and therefore better handling unseen adversarial samples. This method can be generalized across word embeddings, even when they share minimal overlap at both vocabulary and word-substitution levels. To evaluate the effectiveness of our approach, we conduct experiments on BERT and RoBERTa models on three datasets. The results demonstrate promising state-of-the-art robustness.

SemRoDe: Macro Adversarial Training to Learn Representations That are Robust to Word-Level Attacks

TL;DR

This work proposes a novel approach called Semantic Robust Defence (SemRoDe), a Macro Adversarial Training strategy to enhance the robustness of LMs, and aims to generalized across word embeddings, even when they share minimal overlap at both vocabulary and word-substitution levels.

Abstract

Language models (LMs) are indispensable tools for natural language processing tasks, but their vulnerability to adversarial attacks remains a concern. While current research has explored adversarial training techniques, their improvements to defend against word-level attacks have been limited. In this work, we propose a novel approach called Semantic Robust Defence (SemRoDe), a Macro Adversarial Training strategy to enhance the robustness of LMs. Drawing inspiration from recent studies in the image domain, we investigate and later confirm that in a discrete data setting such as language, adversarial samples generated via word substitutions do indeed belong to an adversarial domain exhibiting a high Wasserstein distance from the base domain. Our method learns a robust representation that bridges these two domains. We hypothesize that if samples were not projected into an adversarial domain, but instead to a domain with minimal shift, it would improve attack robustness. We align the domains by incorporating a new distance-based objective. With this, our model is able to learn more generalized representations by aligning the model's high-level output features and therefore better handling unseen adversarial samples. This method can be generalized across word embeddings, even when they share minimal overlap at both vocabulary and word-substitution levels. To evaluate the effectiveness of our approach, we conduct experiments on BERT and RoBERTa models on three datasets. The results demonstrate promising state-of-the-art robustness.
Paper Structure (61 sections, 12 equations, 12 figures, 17 tables, 1 algorithm)

This paper contains 61 sections, 12 equations, 12 figures, 17 tables, 1 algorithm.

Figures (12)

  • Figure 1: The statistical components are aligned in both the base and adversarial domain through the regularizer $\mathcal{L}_{Dist}$. Over time. This alignment allows the model $f$ to project both base and adversarial samples to a robust domain, thus enhancing the robust generalization to adversarial samples.
  • Figure 2: The distributions of the MR training dataset with t-SNE projection in a binary classification task. Heavy overlapping (Top) of the augmented two-class data leads to mixtures of marginal distributions, which is alleviated and nearly linearly separable (Bottom) after applying alignment between the original and augmentation distributions.
  • Figure 3: OT (Top) MMD (Bottom) response over iterations for MR
  • Figure 4: Doing a word substitution is the same as adding a large $\delta$ of fixed size to each word pair. Normally in adversarial training $\delta$ is set to 0.5, this, in comparison, is a small perturbation.
  • Figure 5: Non-Robust model TSNE.
  • ...and 7 more figures