Table of Contents
Fetching ...

How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs

Karin de Langis, Jong Inn Park, Andreas Schramm, Bin Hu, Khanh Chi Le, Michael Mensink, Ahn Thu Tong, Dongyeop Kang

TL;DR

This work probes whether large language models genuinely understand the temporal semantics of narrative aspect or merely pattern-match. By deploying an Expert-in-the-Loop pipeline across truth-value judgments, word completion, and open-ended causal queries, the study reveals consistent prototypicality biases, inconsistent imperfective processing, and limited distal causal reasoning compared with humans. It demonstrates a robust, standardized experimental framework for evaluating cognitive-linguistic capabilities in LLMs, including prompt perturbations and cross-model analyses. The findings highlight fundamental gaps between human-like narrative comprehension and current LLM capabilities, underscoring the need for more cognitively informed models and rigorous evaluation protocols.

Abstract

Large language models (LLMs) exhibit increasingly sophisticated linguistic capabilities, yet the extent to which these behaviors reflect human-like cognition versus advanced pattern recognition remains an open question. In this study, we investigate how LLMs process the temporal meaning of linguistic aspect in narratives that were previously used in human studies. Using an Expert-in-the-Loop probing pipeline, we conduct a series of targeted experiments to assess whether LLMs construct semantic representations and pragmatic inferences in a human-like manner. Our findings show that LLMs over-rely on prototypicality, produce inconsistent aspectual judgments, and struggle with causal reasoning derived from aspect, raising concerns about their ability to fully comprehend narratives. These results suggest that LLMs process aspect fundamentally differently from humans and lack robust narrative understanding. Beyond these empirical findings, we develop a standardized experimental framework for the reliable assessment of LLMs' cognitive and linguistic capabilities.

How LLMs Comprehend Temporal Meaning in Narratives: A Case Study in Cognitive Evaluation of LLMs

TL;DR

This work probes whether large language models genuinely understand the temporal semantics of narrative aspect or merely pattern-match. By deploying an Expert-in-the-Loop pipeline across truth-value judgments, word completion, and open-ended causal queries, the study reveals consistent prototypicality biases, inconsistent imperfective processing, and limited distal causal reasoning compared with humans. It demonstrates a robust, standardized experimental framework for evaluating cognitive-linguistic capabilities in LLMs, including prompt perturbations and cross-model analyses. The findings highlight fundamental gaps between human-like narrative comprehension and current LLM capabilities, underscoring the need for more cognitively informed models and rigorous evaluation protocols.

Abstract

Large language models (LLMs) exhibit increasingly sophisticated linguistic capabilities, yet the extent to which these behaviors reflect human-like cognition versus advanced pattern recognition remains an open question. In this study, we investigate how LLMs process the temporal meaning of linguistic aspect in narratives that were previously used in human studies. Using an Expert-in-the-Loop probing pipeline, we conduct a series of targeted experiments to assess whether LLMs construct semantic representations and pragmatic inferences in a human-like manner. Our findings show that LLMs over-rely on prototypicality, produce inconsistent aspectual judgments, and struggle with causal reasoning derived from aspect, raising concerns about their ability to fully comprehend narratives. These results suggest that LLMs process aspect fundamentally differently from humans and lack robust narrative understanding. Beyond these empirical findings, we develop a standardized experimental framework for the reliable assessment of LLMs' cognitive and linguistic capabilities.

Paper Structure

This paper contains 38 sections, 7 figures, 6 tables.

Figures (7)

  • Figure 1: We examine how LLMs understand differences in aspect by presenting LLMs with narratives that have a key word either in the imperfective or perfective (e.g., "was passing" vs. "passed") followed by comprehension probes adapted from previous human studies.
  • Figure 2: A conceptual overview of our Expert-in-the-Loop probing pipeline for assessing cognitive abilities of LLMs, designed in close collaboration with domain experts from cognitive science.
  • Figure 3: Accuracy in semantic truth-value judgments for events marked with imperfective aspect for LLMs, when the events are embedded within a narrative (shaded bars) versus not. For imperfective events, LLMs have much lower accuracy rates than humans when judging whether the event's resulting final state is valid. Further, LLMs seem to be heavily affected by the presence or absence of a narrative, especially when judging the negative polarity of final states. Notably, the presence or absence of a narrative changes responses in inconsistent directions. Error bars represent the standard error.
  • Figure 4: Frequencies at which word completion rates match the target word from Cause 1 across models. Shaded bars are for imperfective aspect in Cause 1. LLM completions have significantly higher match frequencies when the probe directly follows Cause 1 (top) and are reduced after the effect (bottom).
  • Figure 5: As LLM parameter size increases, there is a trend towards more human-like causal inferences with respect to the Cause 1 event when Cause 1 is in the imperfective. When Cause 1 is in the perfective, LLMs are consistently below human causal inference rates.
  • ...and 2 more figures