Table of Contents
Fetching ...

Stress-Testing Neural Models of Natural Language Inference with Multiply-Quantified Sentences

Atticus Geiger, Ignacio Cases, Lauri Karttunen, Christopher Potts

TL;DR

The paper addresses the gap in evaluating semantic fidelity of neural NLI models by generating synthetic multiply-quantified sentences with controllable semantic complexity. It proposes a formal grammar, translates to first-order logic, and uses Prover9 to label premise-hypothesis pairs, enabling precise diagnostics. It finds that only the CompTreeNN with word-by-word alignment preserves lexical identity and achieves perfect accuracy, while other architectures suffer systematic errors due to an information bottleneck. The work demonstrates the value of architecture-aware data generation for diagnosing semantic learning and suggests directions for building more robust models.

Abstract

Standard evaluations of deep learning models for semantics using naturalistic corpora are limited in what they can tell us about the fidelity of the learned representations, because the corpora rarely come with good measures of semantic complexity. To overcome this limitation, we present a method for generating data sets of multiply-quantified natural language inference (NLI) examples in which semantic complexity can be precisely characterized, and we use this method to show that a variety of common architectures for NLI inevitably fail to encode crucial information; only a model with forced lexical alignments avoids this damaging information loss.

Stress-Testing Neural Models of Natural Language Inference with Multiply-Quantified Sentences

TL;DR

The paper addresses the gap in evaluating semantic fidelity of neural NLI models by generating synthetic multiply-quantified sentences with controllable semantic complexity. It proposes a formal grammar, translates to first-order logic, and uses Prover9 to label premise-hypothesis pairs, enabling precise diagnostics. It finds that only the CompTreeNN with word-by-word alignment preserves lexical identity and achieves perfect accuracy, while other architectures suffer systematic errors due to an information bottleneck. The work demonstrates the value of architecture-aware data generation for diagnosing semantic learning and suggests directions for building more robust models.

Abstract

Standard evaluations of deep learning models for semantics using naturalistic corpora are limited in what they can tell us about the fidelity of the learned representations, because the corpora rarely come with good measures of semantic complexity. To overcome this limitation, we present a method for generating data sets of multiply-quantified natural language inference (NLI) examples in which semantic complexity can be precisely characterized, and we use this method to show that a variety of common architectures for NLI inevitably fail to encode crucial information; only a model with forced lexical alignments avoids this damaging information loss.

Paper Structure

This paper contains 8 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: An aligned NLI example. At each non-terminal node, we show the full range of semantic relations that can hold between the two aligned phrases given the constraints we impose on these examples. We use the seven basic semantic relations of MacCartney:09: $\#$ = independence; $\sqsubset$ = entailment; $\sqsupset$ = reverse entailment; $|$ = alternation; $\smile$ = cover; $\hat{} \:$ = negation; $\equiv$ = equivalence.
  • Figure 2: Model performance throughout training.
  • Figure 3: LSTM performance with different hyperparameters. The lower group makes decisions based on only quantifiers, negation, and empty words. Upper group partly overcomes that suboptimal solution.