Table of Contents
Fetching ...

Breaking NLI Systems with Sentences that Require Simple Lexical Inferences

Max Glockner, Vered Shwartz, Yoav Goldberg

TL;DR

The paper reveals that state-of-the-art NLI models trained on SNLI struggle to perform simple lexical inferences that rely on lexical and world knowledge. It constructs an adversarial test set by minimally altering training sentences to probe hypernymy, co-hyponymy, and related relations, while controlling vocabulary. Evaluations show substantial accuracy drops across models, even with additional data, though a WordNet-enhanced model (KIM) fares best, highlighting the role of external lexical knowledge. The study argues for incorporating lexical knowledge into learning and provides a practical benchmark for evaluating lexical inference abilities in NLI systems.

Abstract

We create a new NLI test set that shows the deficiency of state-of-the-art models in inferences that require lexical and world knowledge. The new examples are simpler than the SNLI test set, containing sentences that differ by at most one word from sentences in the training set. Yet, the performance on the new test set is substantially worse across systems trained on SNLI, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences.

Breaking NLI Systems with Sentences that Require Simple Lexical Inferences

TL;DR

The paper reveals that state-of-the-art NLI models trained on SNLI struggle to perform simple lexical inferences that rely on lexical and world knowledge. It constructs an adversarial test set by minimally altering training sentences to probe hypernymy, co-hyponymy, and related relations, while controlling vocabulary. Evaluations show substantial accuracy drops across models, even with additional data, though a WordNet-enhanced model (KIM) fares best, highlighting the role of external lexical knowledge. The study argues for incorporating lexical knowledge into learning and provides a practical benchmark for evaluating lexical inference abilities in NLI systems.

Abstract

We create a new NLI test set that shows the deficiency of state-of-the-art models in inferences that require lexical and world knowledge. The new examples are simpler than the SNLI test set, containing sentences that differ by at most one word from sentences in the training set. Yet, the performance on the new test set is substantially worse across systems trained on SNLI, demonstrating that these systems are limited in their generalization ability, failing to capture many simple inferences.

Paper Structure

This paper contains 21 sections, 4 tables.