Language models align with human judgments on key grammatical constructions

Jennifer Hu; Kyle Mahowald; Gary Lupyan; Anna Ivanova; Roger Levy

Language models align with human judgments on key grammatical constructions

Jennifer Hu, Kyle Mahowald, Gary Lupyan, Anna Ivanova, Roger Levy

Abstract

Do large language models (LLMs) make human-like linguistic generalizations? Dentella et al. (2023) ("DGL") prompt several LLMs ("Is the following sentence grammatically correct in English?") to elicit grammaticality judgments of 80 English sentences, concluding that LLMs demonstrate a "yes-response bias" and a "failure to distinguish grammatical from ungrammatical sentences". We re-evaluate LLM performance using well-established practices and find that DGL's data in fact provide evidence for just how well LLMs capture human behaviors. Models not only achieve high accuracy overall, but also capture fine-grained variation in human linguistic judgments.

Language models align with human judgments on key grammatical constructions

Abstract

Paper Structure (2 figures, 1 table)

This paper contains 2 figures, 1 table.

Figures (2)

Figure 1: (a) Accuracy scores achieved by models on a version of DGL's original materials with minimal pairs. For each phenomenon, accuracy is computed as the proportion of items in that phenomenon where the model assigns higher probability to the grammatical version of that item (minimal pair) than the ungrammatical version. (b) x-axis: Difference in sum surprisal (negative log probability) between the sentence presented to humans in DGL's experiments versus its counterpart in the minimal pair. y-axis: Human acceptance rate (proportion judged as grammatical) for the presented sentence in each minimal pair. Each point represents a minimal pair test item.
Figure 2: (a) Participant-specific acceptance rates (i.e., rate of judging as grammatical) for sentences that DGL label as "grammatical" (x-axis) versus "ungrammatical" (y-axis). If participants' responses perfectly reflected DGL's normative coding, then all participants would be in the bottom right corner (as exemplified by Intrusive Resumption). (b) Confusion matrices achieved by models and humans on each phenomenon, when evaluating models using the same prompt that was seen by humans ("Is the following sentence grammatically correct in English? [SENTENCE] Respond with C if it is correct, and N if it is not correct."). "Gram." = grammatical, and "Ungram." = ungrammatical. A small fraction of davinci2 and davinci3's responses (4%) were not codeable as corresponding to "C" or "N", resulting in missing data.