Table of Contents
Fetching ...

Beyond Leaderboards: A survey of methods for revealing weaknesses in Natural Language Inference data and models

Viktor Schlegel, Goran Nenadic, Riza Batista-Navarro

TL;DR

This survey addresses the problem that high leaderboard scores in natural language inference can stem from superficial data cues rather than genuine reasoning. It synthesizes a taxonomy of methods to reveal weaknesses, spanning data-focused analyses, model-focused evaluations, and model-improvement strategies across RTE and MRC tasks. Key findings show pervasive spurious correlations, cue exploitation by models, and limited generalisation, even as robustness techniques like data augmentation and adversarial evaluation prove beneficial. The work provides a structured framework and actionable recommendations to improve dataset design, model robustness, and future research directions beyond single-dataset performance benchmarks.

Abstract

Recent years have seen a growing number of publications that analyse Natural Language Inference (NLI) datasets for superficial cues, whether they undermine the complexity of the tasks underlying those datasets and how they impact those models that are optimised and evaluated on this data. This structured survey provides an overview of the evolving research area by categorising reported weaknesses in models and datasets and the methods proposed to reveal and alleviate those weaknesses for the English language. We summarise and discuss the findings and conclude with a set of recommendations for possible future research directions. We hope it will be a useful resource for researchers who propose new datasets, to have a set of tools to assess the suitability and quality of their data to evaluate various phenomena of interest, as well as those who develop novel architectures, to further understand the implications of their improvements with respect to their model's acquired capabilities.

Beyond Leaderboards: A survey of methods for revealing weaknesses in Natural Language Inference data and models

TL;DR

This survey addresses the problem that high leaderboard scores in natural language inference can stem from superficial data cues rather than genuine reasoning. It synthesizes a taxonomy of methods to reveal weaknesses, spanning data-focused analyses, model-focused evaluations, and model-improvement strategies across RTE and MRC tasks. Key findings show pervasive spurious correlations, cue exploitation by models, and limited generalisation, even as robustness techniques like data augmentation and adversarial evaluation prove beneficial. The work provides a structured framework and actionable recommendations to improve dataset design, model robustness, and future research directions beyond single-dataset performance benchmarks.

Abstract

Recent years have seen a growing number of publications that analyse Natural Language Inference (NLI) datasets for superficial cues, whether they undermine the complexity of the tasks underlying those datasets and how they impact those models that are optimised and evaluated on this data. This structured survey provides an overview of the evolving research area by categorising reported weaknesses in models and datasets and the methods proposed to reveal and alleviate those weaknesses for the English language. We summarise and discuss the findings and conclude with a set of recommendations for possible future research directions. We hope it will be a useful resource for researchers who propose new datasets, to have a set of tools to assess the suitability and quality of their data to evaluate various phenomena of interest, as well as those who develop novel architectures, to further understand the implications of their improvements with respect to their model's acquired capabilities.

Paper Structure

This paper contains 34 sections, 6 figures, 1 table.

Figures (6)

  • Figure 1: Number of premise-hypothesis pairs in an RTE dataset following lexical patterns, spuriously skewed towards Entailmentmccoy2019right.
  • Figure 2: Models' over-stability towards common words in question and paragraph, revealed by adversarially inserting distracting sentences Jia2017.
  • Figure 3: Taxonomy of investigated methods. Dashed arrows indicate conceptually related types of methods, i.e. a method of one type are commonly applied with another method of the related type. Labels (a), (b) and (c) correspond to the coarse grouping discussed in Section \ref{['sec:methods']}.
  • Figure 4: Number of methods per category split by task. As multiple papers report more than one method, the maximum (86) does not add up to the number of surveyed papers (69).
  • Figure 5: Dataset by publication year with no or any spurious correlations detection methods applied; applied in a later publication; created using adversarial filtering, or both.
  • ...and 1 more figures