Table of Contents
Fetching ...

Large Language Models for Psycholinguistic Plausibility Pretesting

Samuel Joseph Amouyal, Aya Meltzer-Asscher, Jonathan Berant

TL;DR

This work investigates whether Large Language Models can generate plausibility judgments for psycholinguistic pretesting, potentially reducing the cost and time of materials pretests. It systematically evaluates GPT-4 and several open-source LMs against human judgments across four varied syntactic datasets, using prompts with global and dataset-specific exemplars and a 20-ratings-per-sentence protocol. The results show GPT-4 achieves high correlations with human judgments across structures, enabling effective coarse-grained pretesting, but fine-grained discrimination remains challenging even for GPT-4; the study also provides a method to map LM judgments to human judgments and to filter materials via recall-precision analysis. Findings highlight practical cost savings and propose a workflow for LM-assisted pretesting, while outlining limitations and directions for prompt design, calibration, and applicability to low-resource languages.

Abstract

In psycholinguistics, the creation of controlled materials is crucial to ensure that research outcomes are solely attributed to the intended manipulations and not influenced by extraneous factors. To achieve this, psycholinguists typically pretest linguistic materials, where a common pretest is to solicit plausibility judgments from human evaluators on specific sentences. In this work, we investigate whether Language Models (LMs) can be used to generate these plausibility judgements. We investigate a wide range of LMs across multiple linguistic structures and evaluate whether their plausibility judgements correlate with human judgements. We find that GPT-4 plausibility judgements highly correlate with human judgements across the structures we examine, whereas other LMs correlate well with humans on commonly used syntactic structures. We then test whether this correlation implies that LMs can be used instead of humans for pretesting. We find that when coarse-grained plausibility judgements are needed, this works well, but when fine-grained judgements are necessary, even GPT-4 does not provide satisfactory discriminative power.

Large Language Models for Psycholinguistic Plausibility Pretesting

TL;DR

This work investigates whether Large Language Models can generate plausibility judgments for psycholinguistic pretesting, potentially reducing the cost and time of materials pretests. It systematically evaluates GPT-4 and several open-source LMs against human judgments across four varied syntactic datasets, using prompts with global and dataset-specific exemplars and a 20-ratings-per-sentence protocol. The results show GPT-4 achieves high correlations with human judgments across structures, enabling effective coarse-grained pretesting, but fine-grained discrimination remains challenging even for GPT-4; the study also provides a method to map LM judgments to human judgments and to filter materials via recall-precision analysis. Findings highlight practical cost savings and propose a workflow for LM-assisted pretesting, while outlining limitations and directions for prompt design, calibration, and applicability to low-resource languages.

Abstract

In psycholinguistics, the creation of controlled materials is crucial to ensure that research outcomes are solely attributed to the intended manipulations and not influenced by extraneous factors. To achieve this, psycholinguists typically pretest linguistic materials, where a common pretest is to solicit plausibility judgments from human evaluators on specific sentences. In this work, we investigate whether Language Models (LMs) can be used to generate these plausibility judgements. We investigate a wide range of LMs across multiple linguistic structures and evaluate whether their plausibility judgements correlate with human judgements. We find that GPT-4 plausibility judgements highly correlate with human judgements across the structures we examine, whereas other LMs correlate well with humans on commonly used syntactic structures. We then test whether this correlation implies that LMs can be used instead of humans for pretesting. We find that when coarse-grained plausibility judgements are needed, this works well, but when fine-grained judgements are necessary, even GPT-4 does not provide satisfactory discriminative power.
Paper Structure (27 sections, 12 figures, 4 tables)

This paper contains 27 sections, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Correlation between average human plausibility ratings and average LLM plausibility ratings across four pretesting datasets, along with the fitted linear regression and Pearson correlation. We plot the LLM with the highest correlation (GPT-4 in all cases, except for the bottom right where GPT-3.5 is shown).
  • Figure 2: A breakdown of the correlation for the specific prompt for a subset of the models.
  • Figure 3: The correlation of the model that uses specific prompt when examples are included (full bar) versus when they are excluded (hatched bar).
  • Figure 4: Recall-precision curve when filtering out implausible sentences. Blue is for the specific prompt, red is for the global prompt. We also mark for a few points the threshold value that results in a particular recall-precision result. For Chow et al. and Huang et al. we reach very high precision while keeping a large fraction of the sentences. For Rich et al. we can keep roughly half the sentences with precision of 0.8-0.9.
  • Figure 5: Recall-precision curve when filtering out plausible sentences. Blue is for the specific prompt, red is for the global prompt. We also mark for a few points the threshold value that results in a particular recall-precision result. In both setups, we can obtain very high precision while keeping most of the sentences.
  • ...and 7 more figures