Unpacking the Resilience of SNLI Contradiction Examples to Attacks

Chetan Verma; Archit Agarwal

Unpacking the Resilience of SNLI Contradiction Examples to Attacks

Chetan Verma, Archit Agarwal

TL;DR

The paper probes why SNLI-based NLI models rely on spurious correlations by applying universal adversarial triggers to ELECTRA-small. It shows that entailment and neutral classes are more vulnerable than contradiction, and demonstrates that fine-tuning on a trigger-augmented dataset can restore robust performance on both standard and challenge sets. The work highlights that adversarial triggers are effective at exposing biases and guiding debiasing, with practical implications for dataset design and model training. Overall, it provides a practical pathway to improve NLI robustness by combining adversarial probes with inoculation strategies.

Abstract

Pre-trained models excel on NLI benchmarks like SNLI and MultiNLI, but their true language understanding remains uncertain. Models trained only on hypotheses and labels achieve high accuracy, indicating reliance on dataset biases and spurious correlations. To explore this issue, we applied the Universal Adversarial Attack to examine the model's vulnerabilities. Our analysis revealed substantial drops in accuracy for the entailment and neutral classes, whereas the contradiction class exhibited a smaller decline. Fine-tuning the model on an augmented dataset with adversarial examples restored its performance to near-baseline levels for both the standard and challenge sets. Our findings highlight the value of adversarial triggers in identifying spurious correlations and improving robustness while providing insights into the resilience of the contradiction class to adversarial attacks.

Unpacking the Resilience of SNLI Contradiction Examples to Attacks

TL;DR

Abstract

Unpacking the Resilience of SNLI Contradiction Examples to Attacks

TL;DR

Abstract

Paper Structure

Table of Contents