Table of Contents
Fetching ...

Contextual Breach: Assessing the Robustness of Transformer-based QA Models

Asir Saadat, Nahian Ibn Asad

TL;DR

This work addresses the robustness of transformer-based contextual QA models to adversarial perturbations applied to input contexts. It presents a SQuAD-derived benchmark with 30,000 QA pairs perturbed by 7 noise types across 5 intensities and introduces three metrics to quantify robustness. Experiments across five models (BERT, DeBERTa, ELECTRA, DistilBERT, RoBERTa) show DeBERTa and DistilBERT are typically more robust, while BERT is more vulnerable; perturbations like Character Deletion and Word Reordering are especially damaging, highlighting the importance of semantic understanding. The proposed framework enables systematic robustness analysis and can guide training and evaluation strategies to improve performance in realistic, noisy contexts.

Abstract

Contextual question-answering models are susceptible to adversarial perturbations to input context, commonly observed in real-world scenarios. These adversarial noises are designed to degrade the performance of the model by distorting the textual input. We introduce a unique dataset that incorporates seven distinct types of adversarial noise into the context, each applied at five different intensity levels on the SQuAD dataset. To quantify the robustness, we utilize robustness metrics providing a standardized measure for assessing model performance across varying noise types and levels. Experiments on transformer-based question-answering models reveal robustness vulnerabilities and important insights into the model's performance in realistic textual input.

Contextual Breach: Assessing the Robustness of Transformer-based QA Models

TL;DR

This work addresses the robustness of transformer-based contextual QA models to adversarial perturbations applied to input contexts. It presents a SQuAD-derived benchmark with 30,000 QA pairs perturbed by 7 noise types across 5 intensities and introduces three metrics to quantify robustness. Experiments across five models (BERT, DeBERTa, ELECTRA, DistilBERT, RoBERTa) show DeBERTa and DistilBERT are typically more robust, while BERT is more vulnerable; perturbations like Character Deletion and Word Reordering are especially damaging, highlighting the importance of semantic understanding. The proposed framework enables systematic robustness analysis and can guide training and evaluation strategies to improve performance in realistic, noisy contexts.

Abstract

Contextual question-answering models are susceptible to adversarial perturbations to input context, commonly observed in real-world scenarios. These adversarial noises are designed to degrade the performance of the model by distorting the textual input. We introduce a unique dataset that incorporates seven distinct types of adversarial noise into the context, each applied at five different intensity levels on the SQuAD dataset. To quantify the robustness, we utilize robustness metrics providing a standardized measure for assessing model performance across varying noise types and levels. Experiments on transformer-based question-answering models reveal robustness vulnerabilities and important insights into the model's performance in realistic textual input.
Paper Structure (26 sections, 12 equations, 3 figures, 3 tables)

This paper contains 26 sections, 12 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of the Robustness Evaluation Framework for QA Models. The figure illustrates the process of adding adversarial noise to the context from the SQuAD dataset and feeding the perturbed context into QA models for inference. The predicted answers are then evaluated using contextual robustness metrics such as Accuracy, Robustness Index, Error Rate, and Noise Impact Factor.
  • Figure 2: Scatter plots showing the relationships between different metrics under the noise "Char Del": Robustness Index vs Error Rate, Error Rate vs NIF, and Robustness Index vs NIF.
  • Figure 3: Performance of DeBERTa under Different Noise Types and Levels.