Table of Contents
Fetching ...

A Psycholinguistic Evaluation of Language Models' Sensitivity to Argument Roles

Eun-Kyoung Rosa Lee, Sathvik Nair, Naomi Feldman

Abstract

We present a systematic evaluation of large language models' sensitivity to argument roles, i.e., who did what to whom, by replicating psycholinguistic studies on human argument role processing. In three experiments, we find that language models are able to distinguish verbs that appear in plausible and implausible contexts, where plausibility is determined through the relation between the verb and its preceding arguments. However, none of the models capture the same selective patterns that human comprehenders exhibit during real-time verb prediction. This indicates that language models' capacity to detect verb plausibility does not arise from the same mechanism that underlies human real-time sentence processing.

A Psycholinguistic Evaluation of Language Models' Sensitivity to Argument Roles

Abstract

We present a systematic evaluation of large language models' sensitivity to argument roles, i.e., who did what to whom, by replicating psycholinguistic studies on human argument role processing. In three experiments, we find that language models are able to distinguish verbs that appear in plausible and implausible contexts, where plausibility is determined through the relation between the verb and its preceding arguments. However, none of the models capture the same selective patterns that human comprehenders exhibit during real-time verb prediction. This indicates that language models' capacity to detect verb plausibility does not arise from the same mechanism that underlies human real-time sentence processing.

Paper Structure

This paper contains 16 sections, 2 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Surprisal effects plotted by condition and model. Higher values indicate greater role-sensitivity.
  • Figure 2: Classification accuracies for probes trained to distinguish plausible and implausible verbs under different conditions. Highlighted areas indicate standard errors of the mean across the 10 cross-validation folds. Dotted lines indicate at-chance accuracy.
  • Figure 3: Surprisal effects for control items plotted by condition and model. Compare to change-verb for Kim & Osterhout, swap-arguments and replace-argument for Chow et al.