A Curious Case of Searching for the Correlation between Training Data and Adversarial Robustness of Transformer Textual Models

Cuong Dang; Dung D. Le; Thai Le

A Curious Case of Searching for the Correlation between Training Data and Adversarial Robustness of Transformer Textual Models

Cuong Dang, Dung D. Le, Thai Le

TL;DR

The paper tackles the challenge of linking the robustness of transformer-based NLP classifiers to the nature of their fine-tuning data. It introduces a data-first framework that extracts 13 dataset-level features and trains a lightweight regression model (notably Random Forest) to predict adversarial robustness as measured by attack success rate, significantly reducing evaluation time while remaining transferable across models and robust to randomness. Key findings show that embedding-distribution metrics and token statistics are strong predictors of ASR, with clear correlations between features like CHI, FR, and the number of tokens or classes and robustness. The approach offers a fast, interpretable tool for robustness assessment and data-centric improvement, with practical implications for adversarial training, model selection, and data debugging, while acknowledging limitations and avenues for extending the framework to broader architectures and scenarios.

Abstract

Existing works have shown that fine-tuned textual transformer models achieve state-of-the-art prediction performances but are also vulnerable to adversarial text perturbations. Traditional adversarial evaluation is often done \textit{only after} fine-tuning the models and ignoring the training data. In this paper, we want to prove that there is also a strong correlation between training data and model robustness. To this end, we extract 13 different features representing a wide range of input fine-tuning corpora properties and use them to predict the adversarial robustness of the fine-tuned models. Focusing mostly on encoder-only transformer models BERT and RoBERTa with additional results for BART, ELECTRA, and GPT2, we provide diverse evidence to support our argument. First, empirical analyses show that (a) extracted features can be used with a lightweight classifier such as Random Forest to predict the attack success rate effectively, and (b) features with the most influence on the model robustness have a clear correlation with the robustness. Second, our framework can be used as a fast and effective additional tool for robustness evaluation since it (a) saves 30x-193x runtime compared to the traditional technique, (b) is transferable across models, (c) can be used under adversarial training, and (d) robust to statistical randomness. Our code is publicly available at \url{https://github.com/CaptainCuong/RobustText_ACL2024}.

A Curious Case of Searching for the Correlation between Training Data and Adversarial Robustness of Transformer Textual Models

TL;DR

Abstract

Paper Structure (25 sections, 7 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 25 sections, 7 equations, 8 figures, 3 tables, 1 algorithm.

Introduction
Problem Formulation
Method
Phase 1: Data Preparation.
Phase 2: Feature Engineering.
Phase 3: Extract Adversarial Robustness as Regression Labels.
Phase 4: Regression Analysis through Adversarial Robustness Estimation.
Phase 5: Evaluation and Analysis.
Related Work
Experiment Setup
Results, Analyses, and Discussions
Another Tool for Robustness Analysis
Further Discussion
Confounding Factors
Contextual Features
...and 10 more sections

Figures (8)

Figure 1: A novel attempt to bypass both model fine-tuning and adversarial generation and correlate adversarial robustness directly from the training dataset, potentially saving 30x–193x of runtime.
Figure 2: An illustrative overview of our framework for data-first adversarial robustness analysis. Black and blue arrows take one and two previous input(s), respectively, and return an output.
Figure 3: Taxonomy of 13 predictive features (gray) categorized into groups (red) and sub-groups (green).
Figure 4: Embeddings of two fine-tuning datasets projected on a 2D space by t-SNE van2008visualizing. Dataset with more separated clusters (right) results in a fine-tuned model that is more vulnerable to adversarial perturbations.
Figure 5: Importance of the best Random Forest regression model's most important features in predicting ASRs of BERT and RoBERTa in interpolation and extrapolation setting.
...and 3 more figures

A Curious Case of Searching for the Correlation between Training Data and Adversarial Robustness of Transformer Textual Models

TL;DR

Abstract

A Curious Case of Searching for the Correlation between Training Data and Adversarial Robustness of Transformer Textual Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)