Table of Contents
Fetching ...

Can language models handle recursively nested grammatical structures? A case study on comparing models and humans

Andrew Kyle Lampinen

TL;DR

The paper examines how to fairly compare language models with humans on recursively nested grammatical structures, highlighting evaluation confounds across paradigms. By providing simple prompts to large transformers, it shows these models can outperform humans on challenging nesting tasks and even extend beyond the human-tested conditions. Reanalysis of human data suggests early training may not guarantee superior human performance on difficult structures, pointing to evaluation rather than innate limits as a key factor. The work argues for foundation-model–specific evaluation practices and careful, context-aware reporting to enable credible human–model comparisons and informs broader methodological debates in cognitive modeling and NLP.

Abstract

How should we compare the capabilities of language models (LMs) and humans? I draw inspiration from comparative psychology to highlight some challenges. In particular, I consider a case study: processing of recursively nested grammatical structures. Prior work suggests that LMs cannot handle these structures as reliably as humans can. However, the humans were provided with instructions and training, while the LMs were evaluated zero-shot. I therefore match the evaluation more closely. Providing large LMs with a simple prompt -- substantially less content than the human training -- allows the LMs to consistently outperform the human results, and even to extrapolate to more deeply nested conditions than were tested with humans. Further, reanalyzing the prior human data suggests that the humans may not perform above chance at the difficult structures initially. Thus, large LMs may indeed process recursively nested grammatical structures as reliably as humans. This case study highlights how discrepancies in the evaluation can confound comparisons of language models and humans. I therefore reflect on the broader challenge of comparing human and model capabilities, and highlight an important difference between evaluating cognitive models and foundation models.

Can language models handle recursively nested grammatical structures? A case study on comparing models and humans

TL;DR

The paper examines how to fairly compare language models with humans on recursively nested grammatical structures, highlighting evaluation confounds across paradigms. By providing simple prompts to large transformers, it shows these models can outperform humans on challenging nesting tasks and even extend beyond the human-tested conditions. Reanalysis of human data suggests early training may not guarantee superior human performance on difficult structures, pointing to evaluation rather than innate limits as a key factor. The work argues for foundation-model–specific evaluation practices and careful, context-aware reporting to enable credible human–model comparisons and informs broader methodological debates in cognitive modeling and NLP.

Abstract

How should we compare the capabilities of language models (LMs) and humans? I draw inspiration from comparative psychology to highlight some challenges. In particular, I consider a case study: processing of recursively nested grammatical structures. Prior work suggests that LMs cannot handle these structures as reliably as humans can. However, the humans were provided with instructions and training, while the LMs were evaluated zero-shot. I therefore match the evaluation more closely. Providing large LMs with a simple prompt -- substantially less content than the human training -- allows the LMs to consistently outperform the human results, and even to extrapolate to more deeply nested conditions than were tested with humans. Further, reanalyzing the prior human data suggests that the humans may not perform above chance at the difficult structures initially. Thus, large LMs may indeed process recursively nested grammatical structures as reliably as humans. This case study highlights how discrepancies in the evaluation can confound comparisons of language models and humans. I therefore reflect on the broader challenge of comparing human and model capabilities, and highlight an important difference between evaluating cognitive models and foundation models.
Paper Structure (6 sections, 1 equation, 4 figures)

This paper contains 6 sections, 1 equation, 4 figures.

Figures (4)

  • Figure 1: Subject-verb agreement in nested, center-embedded sentences. (\ref{['fig:center_embedded:sentence']}) An example sentence with a three layer nested structure. Grammatical dependencies are highlighted. (\ref{['fig:center_embedded:task']}) The task for the models is to choose the next word in the sentence; in this case, completing the inner dependency with a verb that matches whether the noun is plural or singular. Models (and humans) make relatively more errors on the inner dependency (green), particularly when it is singular and the other two nouns are plural. (Example based on the dataset of lakretz2021causal, as released in srivastava2022beyond.)
  • Figure 2: Error rates by prompt condition---Chinchilla performs well at the long embedded clauses when given a brief prompt. The plots show error rates (lower is better). Dashed lines show human performance in each condition, after 40 training trials, from lakretz2021mechanisms. (\ref{['fig:inner_70B:zero_shots']}) With no prompt, as in lakretz2021causal, Chinchilla performs poorly on two challenging conditions. (\ref{['fig:inner_70B:two_shots']}) With a two-shot prompt, Chinchilla performs comparably or better than humans in all conditions, and better than humans in the key PSP condition. (\ref{['fig:inner_70B:eight_shots']}) With eight shots, Chinchilla performs much better, consistently exhibiting error rates of less than 10% across all conditions. (Error bars are bootstrap 95%-CIs.)
  • Figure 3: Chinchilla, with the same eight-shot prompt, evaluated on more challenging conditions. (\ref{['fig:inner_70B_harder:modifications']}) The modifications to the tasks---either nesting the sentence more deeply (top), or inserting more center distractors (bottom). (\ref{['fig:inner_70B_harder:more_cd']}) Adding two more distractor plural prepositional phrases in the center does not substantially change the error rates. (\ref{['fig:inner_70B_harder:more_depth']}) Increasing the embedding depth, by prepending an additional plural prefix does increase error rates in the most challenging condition---however, the model still performs better than humans do in easier conditions, indicated as dashed lines. (The dashed lines show human performance in the hardest conditions that lakretz2021mechanisms evaluated with humans; the conditions we evaluate the model on here are alterations intended to make the task even more difficult, especially in the PSP condition. Error bars are bootstrap 95%-CIs.)
  • Figure 4: Human error rates on inner/embedded dependencies from lakretz2021mechanisms, reanalyzed to explore learning effects. Note that these plots are after the humans have completed their training phase. (\ref{['fig:human_reanalysis_psp:first_encounter_all']}) Performance on all structures the first time they are encountered after training. Humans do not appear to perform better than chance on the most difficult structures. (\ref{['fig:human_reanalysis_psp:first_encounter_psp']}) Performance on the key PSP structure when it is first encountered, as a function of trial---a proxy for experience on related structures. (\ref{['fig:human_reanalysis_psp:by_encounter']}) Performance on the PSP structure, with the target grammar violation, as a function of the number of times it has been encountered. The results are suggestive of a learning effect, but are not conclusive due to the small sample size. (Points/bars are aggregates across subjects---in panel \ref{['fig:human_reanalysis_psp:first_encounter_psp']}, all subjects who first encountered that structure at that trial. Errorbars in panel \ref{['fig:human_reanalysis_psp:first_encounter_all']} are bootstrap 95%-CIs; lines/ranges in panels \ref{['fig:human_reanalysis_psp:first_encounter_psp']}-\ref{['fig:human_reanalysis_psp:by_encounter']} are logistic regression fits.)