Table of Contents
Fetching ...

Large-scale study of human memory for meaningful narratives

Antonios Georgiou, Tankut Can, Mikhail Katkov, Misha Tsodyks

TL;DR

This work develops a pipeline that uses large language models both to design naturalistic narrative stimuli for large-scale recall and recognition memory experiments, as well as to analyze the results and construct a simple measure for each clause based on semantic similarity to the whole narrative, that shows a strong correlation with recall probability.

Abstract

The statistical study of human memory requires large-scale experiments, involving many stimuli conditions and test subjects. While this approach has proven to be quite fruitful for meaningless material such as random lists of words, naturalistic stimuli, like narratives, have until now resisted such a large-scale study, due to the quantity of manual labor required to design and analyze such experiments. In this work, we develop a pipeline that uses large language models (LLMs) both to design naturalistic narrative stimuli for large-scale recall and recognition memory experiments, as well as to analyze the results. We performed online memory experiments with a large number of participants and collected recognition and recall data for narratives of different sizes. We found that both recall and recognition performance scale linearly with narrative length; however, for longer narratives people tend to summarize the content rather than recalling precise details. To investigate the role of narrative comprehension in memory, we repeated these experiments using scrambled versions of the narratives. Although recall performance declined significantly, recognition remained largely unaffected. Recalls in this condition seem to follow the original narrative order rather than the actual scrambled presentation, pointing to a contextual reconstruction of the story in memory. Finally, using LLM text embeddings, we construct a simple measure for each clause based on semantic similarity to the whole narrative, that shows a strong correlation with recall probability. Overall, our work demonstrates the power of LLMs in accessing new regimes in the study of human memory, as well as suggesting novel psychologically informed benchmarks for LLM performance.

Large-scale study of human memory for meaningful narratives

TL;DR

This work develops a pipeline that uses large language models both to design naturalistic narrative stimuli for large-scale recall and recognition memory experiments, as well as to analyze the results and construct a simple measure for each clause based on semantic similarity to the whole narrative, that shows a strong correlation with recall probability.

Abstract

The statistical study of human memory requires large-scale experiments, involving many stimuli conditions and test subjects. While this approach has proven to be quite fruitful for meaningless material such as random lists of words, naturalistic stimuli, like narratives, have until now resisted such a large-scale study, due to the quantity of manual labor required to design and analyze such experiments. In this work, we develop a pipeline that uses large language models (LLMs) both to design naturalistic narrative stimuli for large-scale recall and recognition memory experiments, as well as to analyze the results. We performed online memory experiments with a large number of participants and collected recognition and recall data for narratives of different sizes. We found that both recall and recognition performance scale linearly with narrative length; however, for longer narratives people tend to summarize the content rather than recalling precise details. To investigate the role of narrative comprehension in memory, we repeated these experiments using scrambled versions of the narratives. Although recall performance declined significantly, recognition remained largely unaffected. Recalls in this condition seem to follow the original narrative order rather than the actual scrambled presentation, pointing to a contextual reconstruction of the story in memory. Finally, using LLM text embeddings, we construct a simple measure for each clause based on semantic similarity to the whole narrative, that shows a strong correlation with recall probability. Overall, our work demonstrates the power of LLMs in accessing new regimes in the study of human memory, as well as suggesting novel psychologically informed benchmarks for LLM performance.
Paper Structure (31 sections, 5 equations, 14 figures, 4 tables)

This paper contains 31 sections, 5 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Reliability of LLM scoring of recalls. 3 authors and GPT-4 (OpenAI API model gpt-4-0613 gpt-4-0613 ) performed scoring of 30 recalls by answering the question of whether the information present in each particular clause is present in each individual recall. (A): comparison between recall probabilities $P_{rec}$ for each clause, as calculated from GPT-4 scores (orange) and average human scores (dashed blue). The full range of human scores is given by the shaded blue region. (B): A strong correlation between human and GPT-4 scores across clauses, with a correlation coefficient (r-value) of 0.94. (C): Correlations between individual human scores, shows overall a strong agreement between human scorers.
  • Figure 2: Human performance in recall and recognition experiments for narratives of different length. (A): Estimated number of retained clauses (M) is plotted as a function of the number of clauses in the narrative (L) measured in recognition experiment. Surprisingly M has similar values in intact and scrambled narrative. (B): Average number of recalled clauses (R) for narratives of different length. In contrast to the M, R drops substantially for scrambled narratives. Also plotted are the average number of clauses used in the recall (green cross), which dips substantially below $R$ for longer narratives, indicating the tendency of subjects to summarize. (C): Average number of recalled clauses vs. number of retained clauses from the same story. As expected from panels a) and b) the number of retrieved clauses in scrambled narrative is substantially smaller that in intact narrative for the same number of retained clauses. For comparison we presented the theoretical prediction for the random list of words, which was shown to describe data well naim2020fundamental. It is clear that there are more clauses recalled in intact narratives than words in lists of random words. Surprisingly, retrieval of scrambled stories is significantly worse than random lists, suggesting an active suppression of items in service of generating a coherent recall (participants were implicitly instructed to recall story). Finally, we show the mean number of clauses in a recall (green crosses), which is insensitive to the content of the clause and just measures the length of a participant's recall. For $R$ we report standard error in total recall length over the entire population of subjects; for $M$, we calculate error using bootstrap.
  • Figure 3: Recall order. Color-coded order of clauses or words for different conditions are shown in all panels. Recalled clauses or words are stacked together vertically (with the first recalled clause at the bottom of a column, and the last recalled clause at the top). The height of the column represents the total number of clauses or words recalled in a given trial. In panels A, B, and D, color code represents serial position of presentation of clauses or words, from early (red) to later (blue) in presentation position. Panel (C) is the only exception, in which the color code reflects the serial position of clauses in the original (intact) story. (A) shows that recall of coherent stories largely preserves presentation order. (B) recall of random word lists does not preserve presentation order. (C) As with random lists, the recall of a scrambled story does not preserve presentation order, but rather appear to reconstruct the original order of the story, as seen from the color gradients in panel (B). Apparently, random words and scrambled stories are recalled in random order considering their presentation order, but people perform some unscrambling of the scrambled stories as can be seen in (C) - there is tendency of recalled clauses being in the order of original unscrambled narrative. The participants construct a mental representation of the scrambled narrative which is evidently close to its original form. Recall consequently does not reflect input sequence, but rather the original sequence of the clauses.
  • Figure 4: Recognition vs recall performance across different clauses. Clauses from all the narratives used in this study were divided evenly into 15 bins according to their $P_{rec}$, and the average $P_h$ for the clauses in each bin was computed and plotted against the center of the corresponding bin. Error bars show standard error within a bin. We show a linear fit (orange dashed) to the binned data (solid blue). The correlation coefficient $r$ is computed using the unbinned cloud of data points, with the $95\%$ confidence interval calculated using bootstrap with $3000$ samples.
  • Figure 5: Semantic similarity correlates with recall probability. A) Scatter plot of recall probability $P_{rec}$ vs cosine similarity score (described in text) for each clause in a narrative with $L = 19$ clauses. Plotted in orange is the mean $P_{rec}$ and standard error per bin, for 5 bins, with the horizontal coordinate taken at the midpoint of the bin. The correlation coefficient ($r$-value) is 0.8 and statistically significant ($p<< 0.001$). B) Same as (A) with a story of length $L = 32$, and a statistically significant correlation of $r = 0.58$. Error bars are computed using standard error within in each bin. C) shows the correlation coefficient between $P_{rec}$ and similarity scores computed for each story, plotted here as a function of story length. The significance level is indicated in the legend, with green triangles indicating p value $p < 0.001$ (***), blue squares $p < 0.01$ (**), and orange triangles $p < 0.05$ (*), while empty circles indicate no statistically significant correlation ($p>0.05$). p-values are computed using two-sided Wald test. $95\%$ confidence intervals are computed using bootstrap with $1000$ samples, and indicated with capped error bars in the figure. Text embeddings for this figure were obtained using OpenAI's text-embedding-3-small text-embedding-3-small model.
  • ...and 9 more figures