Table of Contents
Fetching ...

Are they human? Detecting large language models by probing human memory constraints

Simon Schug, Brenden M. Lake

Abstract

The validity of online behavioral research relies on study participants being human rather than machine. In the past, it was possible to detect machines by posing simple challenges that were easily solved by humans but not by machines. General-purpose agents based on large language models (LLMs) can now solve many of these challenges, threatening the validity of online behavioral research. Here we explore the idea of detecting humanness by using tasks that machines can solve too well to be human. Specifically, we probe for the existence of an established human cognitive constraint: limited working memory capacity. We show that cognitive modeling on a standard serial recall task can be used to distinguish online participants from LLMs even when the latter are specifically instructed to mimic human working memory constraints. Our results demonstrate that it is viable to use well-established cognitive phenomena to distinguish LLMs from humans.

Are they human? Detecting large language models by probing human memory constraints

Abstract

The validity of online behavioral research relies on study participants being human rather than machine. In the past, it was possible to detect machines by posing simple challenges that were easily solved by humans but not by machines. General-purpose agents based on large language models (LLMs) can now solve many of these challenges, threatening the validity of online behavioral research. Here we explore the idea of detecting humanness by using tasks that machines can solve too well to be human. Specifically, we probe for the existence of an established human cognitive constraint: limited working memory capacity. We show that cognitive modeling on a standard serial recall task can be used to distinguish online participants from LLMs even when the latter are specifically instructed to mimic human working memory constraints. Our results demonstrate that it is viable to use well-established cognitive phenomena to distinguish LLMs from humans.

Paper Structure

This paper contains 23 sections, 4 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview. Left Probed recall working memory paradigm with two probe type conditions chosen uniformly at random within each trial. Center We compare participants recruited online to a sample of large language models (LLMs) with varying system prompts and backbone models on the probed recall working memory task. Right In contrast to humans, transformer-based LLMs have perfect contextual recall through the attention mechanism and need to emulate serial recall effects.
  • Figure 2: Working memory performance comparison on probed recall task. Left Box plot showing average accuracies by participant type. Each box spans from the first quartile to the third quartile with the middle line denoting the median and whiskers reach 1.5 times the interquartile range from the box. Accuracy across participants varies significantly between online participants and LLMs. Despite being instructed to behave human-like, LLMs have near perfect accuracy (LLM-Human). Average accuracy only decreases, when additionally instructed to display specific working memory limitations (LLM-WM) or finetuned on data from psychology experiments (Centaur). Center Average accuracy as a function of serial position index on trials with a set size of 12. Online participants show established serial position effects, remembering items early on (primacy) and at the end of the list (recency) more readily. LLMs show no such effect when instructed to behave human-like (LLM-Human) but qualitatively reproduce these effects when specifically instructed to do so (LLM-WM) or after finetuning (Centaur). Right Average accuracy for each set size. Online participants show a clear set-size effect on accuracy, with performance decreasing due to increasing memory load as the set size grows. Again, LLMs qualitatively reproduce this effect when specifically instructed to do so (LLM-WM) or finetuned (Centaur) but not when instructed to behave human-like (LLM-Human).
  • Figure 3: Working memory profile comparison and anomaly detection. We use a hierarchical Bayesian logistic regression model to capture the probability of giving a correct response in the probed recall task based on established working memory phenomena. Left Jointly fitting this model on both online and LLM participants, we find strong fixed-effects for recency, primacy and capacity. Whiskers denote the 94% highest density interval (HDI). Center The joint model reveals that participant-level working memory profiles captured by their random effects differ notably between participant types. Right We perform anomaly detection by fitting a second model on a subset of online participants and scoring held-out online participants and LLM participants with the model using log predictive densities. The resulting ROC reveals that LLM participants (negative class) can be separated from online participants (positive class) even when designed to mimic working memory.