Table of Contents
Fetching ...

To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times

Thomas Hikaru Clark, Carlos Arriaga, Javier Conde, Gonzalo Martínez, Pedro Reviriego

Abstract

Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by prompting an LLM, in zero-shot fashion, with a question similar to those used in human studies. Meanwhile, for other norms such as lexical decision time or age of acquisition, LLMs require supervised fine-tuning to obtain results that align with ground-truth values. In this paper, we extend this approach to the previously unstudied features of sentence memorability and reading times, which involve the relationship between multiple words in a sentence-level context. Our results show that via fine-tuning, models can provide estimates that correlate with human-derived norms and exceed the predictive power of interpretable baseline predictors, demonstrating that LLMs contain useful information about sentence-level features. At the same time, our results show very mixed zero-shot and few-shot performance, providing further evidence that care is needed when using LLM-prompting as a proxy for human cognitive measures.

To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times

Abstract

Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by prompting an LLM, in zero-shot fashion, with a question similar to those used in human studies. Meanwhile, for other norms such as lexical decision time or age of acquisition, LLMs require supervised fine-tuning to obtain results that align with ground-truth values. In this paper, we extend this approach to the previously unstudied features of sentence memorability and reading times, which involve the relationship between multiple words in a sentence-level context. Our results show that via fine-tuning, models can provide estimates that correlate with human-derived norms and exceed the predictive power of interpretable baseline predictors, demonstrating that LLMs contain useful information about sentence-level features. At the same time, our results show very mixed zero-shot and few-shot performance, providing further evidence that care is needed when using LLM-prompting as a proxy for human cognitive measures.
Paper Structure (24 sections, 6 figures, 2 tables)

This paper contains 24 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of our approach, contrasting zero-shot prompting with fine-tuning on small supervised datasets for predicting psycholinguistic norms.
  • Figure 2: The correlation between model predictions and ground truth norms for word memorability, across 3 model families and in 3 evaluation regimes. For comparison, the mean correlation with the predictions of interpretable baseline predictors is included.
  • Figure 3: The correlation between model predictions and ground truth norms for sentence memorability, across 3 model families and in 3 evaluation regimes. For comparison, the mean correlation with the predictions of interpretable baseline predictors is included.
  • Figure 4: The correlation between model predictions and ground truth norms for self-paced reading times (Natural Stories corpus), across 3 model families and in 3 evaluation regimes. For comparison, the mean correlation with the predictions of interpretable baseline predictors is included.
  • Figure 5: The correlation between model predictions and ground truth norms for eye-tracking reading times (OneStop corpus), across 3 model families and in 3 evaluation regimes. For comparison, the mean correlation with the predictions of interpretable baseline predictors is included.
  • ...and 1 more figures