Table of Contents
Fetching ...

How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics

Adrian Cosma, Stefan Ruseti, Mihai Dascalu, Cornelia Caragea

TL;DR

This research categorizes the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics, reducing spurious correlation measures and providing a more authentic evaluation of model performance.

Abstract

Natural Language Inference (NLI) evaluation is crucial for assessing language understanding models; however, popular datasets suffer from systematic spurious correlations that artificially inflate actual model performance. To address this, we propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. This categorization significantly reduces spurious correlation measures, with examples labeled as having the highest difficulty showing markedly decreased performance and encompassing more realistic and diverse linguistic phenomena. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset, surpassing other dataset characterization techniques. Our research addresses limitations in NLI dataset construction, providing a more authentic evaluation of model performance with implications for diverse NLU applications.

How Hard is this Test Set? NLI Characterization by Exploiting Training Dynamics

TL;DR

This research categorizes the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics, reducing spurious correlation measures and providing a more authentic evaluation of model performance.

Abstract

Natural Language Inference (NLI) evaluation is crucial for assessing language understanding models; however, popular datasets suffer from systematic spurious correlations that artificially inflate actual model performance. To address this, we propose a method for the automated creation of a challenging test set without relying on the manual construction of artificial and unrealistic examples. We categorize the test set of popular NLI datasets into three difficulty levels by leveraging methods that exploit training dynamics. This categorization significantly reduces spurious correlation measures, with examples labeled as having the highest difficulty showing markedly decreased performance and encompassing more realistic and diverse linguistic phenomena. When our characterization method is applied to the training set, models trained with only a fraction of the data achieve comparable performance to those trained on the full dataset, surpassing other dataset characterization techniques. Our research addresses limitations in NLI dataset construction, providing a more authentic evaluation of model performance with implications for diverse NLU applications.
Paper Structure (11 sections, 5 equations, 6 figures, 7 tables, 1 algorithm)

This paper contains 11 sections, 5 equations, 6 figures, 7 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overall diagram of our method to automatically construct a challenging test set for NLI.
  • Figure 2: Distributions of feature values across difficulty levels for the test set for SNLI (top), MultiNLI (middle), and FEVER (bottom). In addition to features explored in Data Maps swayamdipta2020dataset, we also incorporated the Average Margin pleiss2020identifying and included training dynamics across a model trained only on the hypothesis.
  • Figure 3: Distributions of the measures of spurious correlations for each level (easy, ambiguous, hard) across the three labels (entailment, neutral, contradiction) for SNLI (top), MultiNLI (middle) and FEVER (bottom).
  • Figure 4: Counts for each class in SNLI, MultiNLI, and FEVER, according to each difficulty level.
  • Figure 5: Comparison between the characterizations obtained by RoBERTa and DeBERTa on the "Contains Negation" heuristic measure.
  • ...and 1 more figures