Table of Contents
Fetching ...

Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures

Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant

TL;DR

The paper addresses whether large language models process sentences in ways that resemble human parsing by directly comparing human and LLM comprehension across seven challenging sentence structures. It uses a unified task and a broad set of 31 models from five families, applying few-shot prompting and, in some cases, thinking-enabled prompts to measure performance on target versus baseline sentences. Key findings show that although LLMs generally outperform humans, garden-path sentences remain particularly difficult for many models, and the degree of alignment with human performance improves with model size; a sweet-spot phenomenon emerges wherein intermediate-sized models best capture human-like directionality. The study provides nuanced insights into the similarities and divergences between human and LLM sentence processing, suggesting that working-memory demands and the ability to discard initial misinterpretations influence where LLMs align with or diverge from human cognition, with implications for model design and evaluation.

Abstract

Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.

Comparing Human and Language Models Sentence Processing Difficulties on Complex Structures

TL;DR

The paper addresses whether large language models process sentences in ways that resemble human parsing by directly comparing human and LLM comprehension across seven challenging sentence structures. It uses a unified task and a broad set of 31 models from five families, applying few-shot prompting and, in some cases, thinking-enabled prompts to measure performance on target versus baseline sentences. Key findings show that although LLMs generally outperform humans, garden-path sentences remain particularly difficult for many models, and the degree of alignment with human performance improves with model size; a sweet-spot phenomenon emerges wherein intermediate-sized models best capture human-like directionality. The study provides nuanced insights into the similarities and divergences between human and LLM sentence processing, suggesting that working-memory demands and the ability to discard initial misinterpretations influence where LLMs align with or diverge from human cognition, with implications for model design and evaluation.

Abstract

Large language models (LLMs) that fluently converse with humans are a reality - but do LLMs experience human-like processing difficulties? We systematically compare human and LLM sentence comprehension across seven challenging linguistic structures. We collect sentence comprehension data from humans and five families of state-of-the-art LLMs, varying in size and training procedure in a unified experimental framework. Our results show LLMs overall struggle on the target structures, but especially on garden path (GP) sentences. Indeed, while the strongest models achieve near perfect accuracy on non-GP structures (93.7% for GPT-5), they struggle on GP structures (46.8% for GPT-5). Additionally, when ranking structures based on average performance, rank correlation between humans and models increases with parameter count. For each target structure, we also collect data for their matched baseline without the difficult structure. Comparing performance on the target vs. baseline sentences, the performance gap observed in humans holds for LLMs, with two exceptions: for models that are too weak performance is uniformly low across both sentence types, and for models that are too strong the performance is uniformly high. Together, these reveal convergence and divergence in human and LLM sentence comprehension, offering new insights into the similarity of humans and LLMs.

Paper Structure

This paper contains 34 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Absolute difference between LLM and human accuracy on the difficult condition.
  • Figure 2: Spearman correlation between humans and LLMs on ranking the difficulty of different structures.
  • Figure 3: Violation rate per structure
  • Figure 4: Example of the first system prompt
  • Figure 5: Example of the second system prompts