Table of Contents
Fetching ...

Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

Ida Momennejad, Hosein Hasanbeig, Felipe Vieira, Hiteshi Sharma, Robert Osazuwa Ness, Nebojsa Jojic, Hamid Palangi, Jonathan Larson

TL;DR

<3-5 sentence high-level summary> The paper introduces CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive abilities in large language models. It applies CogEval to assess cognitive maps and planning across eight LLMs using prompts grounded in human experiments and designed to avoid training-data contamination. The findings show substantial variability across models and task configurations, with no evidence for emergent out-of-the-box cognitive-map or planning abilities; common failure modes include edge hallucinations and loops in dense graphs. The work proposes future directions such as representation analysis and memory/planning augmentations to enhance LLMs' adaptive planning capabilities.

Abstract

Recently an influx of studies claim emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in Large Language Models. The CogEval protocol can be followed for the evaluation of various abilities. Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets. We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and getting trapped in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.

Evaluating Cognitive Maps and Planning in Large Language Models with CogEval

TL;DR

<3-5 sentence high-level summary> The paper introduces CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive abilities in large language models. It applies CogEval to assess cognitive maps and planning across eight LLMs using prompts grounded in human experiments and designed to avoid training-data contamination. The findings show substantial variability across models and task configurations, with no evidence for emergent out-of-the-box cognitive-map or planning abilities; common failure modes include edge hallucinations and loops in dense graphs. The work proposes future directions such as representation analysis and memory/planning augmentations to enhance LLMs' adaptive planning capabilities.

Abstract

Recently an influx of studies claim emergent cognitive abilities in large language models (LLMs). Yet, most rely on anecdotes, overlook contamination of training sets, or lack systematic Evaluation involving multiple tasks, control conditions, multiple iterations, and statistical robustness tests. Here we make two major contributions. First, we propose CogEval, a cognitive science-inspired protocol for the systematic evaluation of cognitive capacities in Large Language Models. The CogEval protocol can be followed for the evaluation of various abilities. Second, here we follow CogEval to systematically evaluate cognitive maps and planning ability across eight LLMs (OpenAI GPT-4, GPT-3.5-turbo-175B, davinci-003-175B, Google Bard, Cohere-xlarge-52.4B, Anthropic Claude-1-52B, LLaMA-13B, and Alpaca-7B). We base our task prompts on human experiments, which offer both established construct validity for evaluating planning, and are absent from LLM training sets. We find that, while LLMs show apparent competence in a few planning tasks with simpler structures, systematic evaluation reveals striking failure modes in planning tasks, including hallucinations of invalid trajectories and getting trapped in loops. These findings do not support the idea of emergent out-of-the-box planning ability in LLMs. This could be because LLMs do not understand the latent relational structures underlying planning problems, known as cognitive maps, and fail at unrolling goal-directed trajectories based on the underlying structure. Implications for application and future directions are discussed.
Paper Structure (14 sections, 4 figures, 3 tables)

This paper contains 14 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The CogEval protocol, Experiment 1 task structure, and example task prompt. (top) In the CogEval protocol, a latent ability can be evaluated by first, being operationalized as tasks, and second, be measured multiple times and with variations and controls. We followed this protocol to evaluate cognitive map and planning. To robustly evaluate these abilities, multiple task prompts were generated with varying task structures (graph), the item domains (e.g., spatial or social), and task conditions (e.g., value-based path, detour). LLM responses were generated 30 times per task prompt and temperature for the three OpenAI models studied in this work and once per task and temperature for other LLMs. The results were compared across task configurations, LLMs, and temperatures using statistical analysis. (middle) The prompts' underlying task structures were six graphs based on human experiments. A: simple line graph from Momennejad2017-wr. B: simple tree graphs based on Momennejad2018-zd. C: graph A with double depth and stochastic transitions. D, E, and F represent community graphs from Schapiro2013-mx, Momennejad2019-tf, and Pudhiyidath2022-sw respectively. (bottom) An example prompt for graph A. This procedure evaluates planning behavior in value-based navigation (see Table \ref{['tab:conditions_explanation']}). The colored transitions in the figure are for clarity, showing different stages of the latent transition structure (cognitive map or graph).
  • Figure 2: Experiment 1 results. (top) Mean and standard error of performance on all tasks for each of the different graphs (see Figure \ref{['fig:graphFig1']} for graph details) across different LLMs studied in this work. (bottom) Mean performance compared across per main task category (see Table \ref{['tab:llms_comparison']} for details).
  • Figure 3: Experiment 2 results. (Bottom) BFS and DFS instructions marginally enhance performance on community graphs. In the Cluster counting task (graph D) adding BFS or DFS is beneficial at temperatures 0 and 0.5 but less at 1. For finding shortest paths within a cluster, BFS or DFS help with BFS being effective at temperature 0. However, for finding the shortest path 1-cluster away, only BFS at temperature 0.5 yields slight improvements.
  • Figure 4: Examples of three failure modes. (left) Edge hallucination. (middle) Failure at finding a 1-step policy within the same cluster. (right) Failure at multi-hop path by both getting trapped in a loop and hallucinating edges. In each example the blue box is the task prompt, the grey box shows the model response, and the green arrows demonstrate the correct response on the graph.