Table of Contents
Fetching ...

Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation?

Yuxuan Gu, Lunjun Liu, Xiaocheng Feng, Kun Zhu, Weihong Zhong, Lei Huang, Bing Qin

Abstract

An essential problem in artificial intelligence is whether LLMs can simulate human cognition or merely imitate surface-level behaviors, while existing datasets suffer from either synthetic reasoning traces or population-level aggregation, failing to capture authentic individual cognitive patterns. We introduce a benchmark grounded in the longitudinal research trajectories of 217 researchers across diverse domains of artificial intelligence, where each author's scientific publications serve as an externalized representation of their cognitive processes. To distinguish whether LLMs transfer cognitive patterns or merely imitate behaviors, our benchmark deliberately employs a cross-domain, temporal-shift generalization setting. A multidimensional cognitive alignment metric is further proposed to assess individual-level cognitive consistency. Through systematic evaluation of state-of-the-art LLMs and various enhancement techniques, we provide a first-stage empirical study on the questions: (1) How well do current LLMs simulate human cognition? and (2) How far can existing techniques enhance these capabilities?

Can Large Language Models Simulate Human Cognition Beyond Behavioral Imitation?

Abstract

An essential problem in artificial intelligence is whether LLMs can simulate human cognition or merely imitate surface-level behaviors, while existing datasets suffer from either synthetic reasoning traces or population-level aggregation, failing to capture authentic individual cognitive patterns. We introduce a benchmark grounded in the longitudinal research trajectories of 217 researchers across diverse domains of artificial intelligence, where each author's scientific publications serve as an externalized representation of their cognitive processes. To distinguish whether LLMs transfer cognitive patterns or merely imitate behaviors, our benchmark deliberately employs a cross-domain, temporal-shift generalization setting. A multidimensional cognitive alignment metric is further proposed to assess individual-level cognitive consistency. Through systematic evaluation of state-of-the-art LLMs and various enhancement techniques, we provide a first-stage empirical study on the questions: (1) How well do current LLMs simulate human cognition? and (2) How far can existing techniques enhance these capabilities?

Paper Structure

This paper contains 38 sections, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Motivating case of human simulation. The memory panel presents two representative cases of the author's prior research with reasoning traits: (1) an explicit awareness of shallow heuristic modeling in LLMs and a deliberate effort to avoid such shortcuts, and (2) a problem-solving style that starts from modeling the research objectives. Upper-right Panel: LLMs can correctly interpret the task objective but seem to rely on directly imitating specific methods from the research memory. Lower-right Panel: In contrast, the authentic author shares a consistent abstraction and reasoning trajectory across problem scenarios. This comparison illustrates a central question motivating our study: whether LLMs can transcend behavioral imitation to internalize and reproduce human-like cognitive processes.
  • Figure 2: Overview of the construction pipeline for our benchmark. It consists of two separate lines: Expert Annotation, which involves fine-grained verification and systematic quality filtering at each stage to ensure data reliability, and Automatic Annotation, which utilizes OCR and LLM-based extraction for expandable data collection. The Segmentation Protocol shows a splitting strategy based on the average publication dates of research topics, enabling cognitive evolution from the test set to the train set.
  • Figure 3: Research topics in our benchmark.
  • Figure 4: Characterization of temporal and semantic variance between train and test sets across scholars. (A) Distribution of temporal shifts, quantified as the difference in average publication year (test minus train). Almost all authors' test sets are more recent than their train sets, with the distribution peaking at about $2$ years. (B) Distribution of semantic gaps, measured by the normalized centroid L2 distance between each author's train and test set embeddings. The dashed line indicates the baseline distance between the global centroids of all train and test sets. It reveals substantial within-author semantic distance concentrated between $0.7$ and $0.9$.
  • Figure 5: Overview of our evaluation framework.