Table of Contents
Fetching ...

Revisiting Real-Time Digging-In Effects: No Evidence from NP/Z Garden-Paths

Amani Maina-Kilaas, Roger Levy

Abstract

Digging-in effects, where disambiguation difficulty increases with longer ambiguous regions, have been cited as evidence for self-organized sentence processing, in which structural commitments strengthen over time. In contrast, surprisal theory predicts no such effect unless lengthening genuinely shifts statistical expectations, and neural language models appear to show the opposite pattern. Whether digging-in is a robust real-time phenomenon in human sentence processing -- or an artifact of wrap-up processes or methodological confounds -- remains unclear. We report two experiments on English NP/Z garden-path sentences using Maze and self-paced reading, comparing human behavior with predictions from an ensemble of large language models. We find no evidence for real-time digging-in effects. Critically, items with sentence-final versus nonfinal disambiguation show qualitatively different patterns: positive digging-in trends appear only sentence-finally, where wrap-up effects confound interpretation. Nonfinal items -- the cleaner test of real-time processing -- show reverse trends consistent with neural model predictions.

Revisiting Real-Time Digging-In Effects: No Evidence from NP/Z Garden-Paths

Abstract

Digging-in effects, where disambiguation difficulty increases with longer ambiguous regions, have been cited as evidence for self-organized sentence processing, in which structural commitments strengthen over time. In contrast, surprisal theory predicts no such effect unless lengthening genuinely shifts statistical expectations, and neural language models appear to show the opposite pattern. Whether digging-in is a robust real-time phenomenon in human sentence processing -- or an artifact of wrap-up processes or methodological confounds -- remains unclear. We report two experiments on English NP/Z garden-path sentences using Maze and self-paced reading, comparing human behavior with predictions from an ensemble of large language models. We find no evidence for real-time digging-in effects. Critically, items with sentence-final versus nonfinal disambiguation show qualitatively different patterns: positive digging-in trends appear only sentence-finally, where wrap-up effects confound interpretation. Nonfinal items -- the cleaner test of real-time processing -- show reverse trends consistent with neural model predictions.
Paper Structure (22 sections, 5 figures, 1 table)

This paper contains 22 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Empirical response times for Experiment 1 (Maze), split by sentence-finality. Top panels show mean word RT by sentence region (omitting regions not in all conditions). Bottom panels show the mean critical word RT, with the right-most averaging RTs within a plausible spillover region. Error bars reflect 95% confidence intervals around by-item means.
  • Figure 2: Mean garden-path effect in Experiment 1 (Maze), split by sentence-finality. Top row shows empirical data, middle shows predicted data; bottom shows surprisal for reference. Error bars reflect 95% confidence intervals around by-item means, but readers should rely on the mixed-effects models for assessing significance due to better variance attribution.
  • Figure 3: Empirical response times for Experiment 2 (SPR), split by sentence-finality. Top panels show mean word RT by sentence region (omitting regions not in all conditions). Bottom panels show the mean critical word RT, with the right-most averaging RTs within a plausible spillover region. Error bars reflect 95% confidence intervals around by-item means.
  • Figure 4: Mean garden-path effect in Experiment 2 (SPR), split by sentence-finality. Top row shows empirical data, middle shows predicted data; bottom shows surprisal for reference. Error bars reflect 95% confidence intervals around by-item means, but readers should rely on the mixed-effects models for assessing significance due to better variance attribution.
  • Figure 5: Empirical vs. LLM-predicted response times in critical items. LLMs underpredict difficulty in disambiguating regions while accurately estimating in other sentence regions.