Table of Contents
Fetching ...

Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds

Victoria Basmov, Yoav Goldberg, Reut Tsarfaty

TL;DR

This work interrogates whether large language models truly grasp simple linguistic entailments that humans find trivial. By constructing targeted NLI benchmarks across grammatically-specified entailments, monotonicity, evidential adverbs, presuppositions, and non-factive embeddings, the authors test zero-shot and chain-of-thought prompts on GPT-3.5, GPT-4, and LLaMA-2, with human baselines for context. Across tasks, GPT-4 remains the most capable but still substantially underperforms humans, and embeddings under presupposition triggers or non-factives systematically mislead models, revealing persistent blind spots in entailment semantics. The study also shows prompt design and chain-of-thought reasoning offer limited, inconsistent gains and underscores the need for principled benchmarks and new training paradigms to improve linguistic competence in LLMs.

Abstract

We evaluate LLMs' language understanding capacities on simple inference tasks that most humans find trivial. Specifically, we target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. We design evaluation sets for these tasks and conduct experiments in both zero-shot and chain-of-thought setups, and with multiple prompts and LLMs. The models exhibit moderate to low performance on these evaluation sets. Subsequent experiments show that embedding the premise in syntactic constructions that should preserve the entailment relations (presupposition triggers) or change them (non-factives), further confuses the models, causing them to either under-predict or over-predict certain entailment labels regardless of the true relation, and often disregarding the nature of the embedding context. Overall these results suggest that, despite LLMs' celebrated language understanding capacity, even the strongest models have blindspots with respect to certain types of entailments, and certain information-packaging structures act as ``blinds'' overshadowing the semantics of the embedded premise.

Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds

TL;DR

This work interrogates whether large language models truly grasp simple linguistic entailments that humans find trivial. By constructing targeted NLI benchmarks across grammatically-specified entailments, monotonicity, evidential adverbs, presuppositions, and non-factive embeddings, the authors test zero-shot and chain-of-thought prompts on GPT-3.5, GPT-4, and LLaMA-2, with human baselines for context. Across tasks, GPT-4 remains the most capable but still substantially underperforms humans, and embeddings under presupposition triggers or non-factives systematically mislead models, revealing persistent blind spots in entailment semantics. The study also shows prompt design and chain-of-thought reasoning offer limited, inconsistent gains and underscores the need for principled benchmarks and new training paradigms to improve linguistic competence in LLMs.

Abstract

We evaluate LLMs' language understanding capacities on simple inference tasks that most humans find trivial. Specifically, we target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments. We design evaluation sets for these tasks and conduct experiments in both zero-shot and chain-of-thought setups, and with multiple prompts and LLMs. The models exhibit moderate to low performance on these evaluation sets. Subsequent experiments show that embedding the premise in syntactic constructions that should preserve the entailment relations (presupposition triggers) or change them (non-factives), further confuses the models, causing them to either under-predict or over-predict certain entailment labels regardless of the true relation, and often disregarding the nature of the embedding context. Overall these results suggest that, despite LLMs' celebrated language understanding capacity, even the strongest models have blindspots with respect to certain types of entailments, and certain information-packaging structures act as ``blinds'' overshadowing the semantics of the embedded premise.
Paper Structure (47 sections, 2 figures, 7 tables)

This paper contains 47 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: High-level summary of the experiments and results (reported numbers are for gpt-3.5-turbo-0301).
  • Figure 2: Instructions for human annotation. Note that in order to not implicitly train annotators towards the linguistic inferences we consider in this work, we provided examples that demonstrate the meaning of the neutral/entailing/contradiction labels, but on other inference types, not covered in this work.