Table of Contents
Fetching ...

CREATE: Testing LLMs for Associative Creativity

Manya Wadhwa, Tiasa Singha Roy, Harvey Lederman, Junyi Jessy Li, Greg Durrett

TL;DR

Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve.

Abstract

A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.

CREATE: Testing LLMs for Associative Creativity

TL;DR

Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve.

Abstract

A key component of creativity is associative reasoning: the ability to draw novel yet meaningful connections between concepts. We introduce CREATE, a benchmark designed to evaluate models' capacity for creative associative reasoning. CREATE requires models to generate sets of paths connecting concepts in a model's parametric knowledge. Paths should have high specificity (distinctiveness and closeness of the concept connection) and high diversity (dissimilarity from other paths), and models are scored more highly if they produce a larger set of strong, diverse paths. This task shares demands of real creativity tasks like hypothesis generation, including an extremely large search space, but enables collection of a sizable benchmark with objective answer grading. Evaluation of frontier models shows that the strongest models achieve higher creative utility than others, with the high multiplicity of answers and complexity of the search making benchmark saturation difficult to achieve. Furthermore, our results illustrate that thinking models are not always more effective on our task, even with high token budgets. Recent approaches for creative prompting give some but limited additional improvement. CREATE provides a sandbox for developing new methods to improve models' capacity for associative creativity.
Paper Structure (39 sections, 9 equations, 9 figures, 18 tables, 1 algorithm)

This paper contains 39 sections, 9 equations, 9 figures, 18 tables, 1 algorithm.

Figures (9)

  • Figure 1: Motivating example of brainstorming paths in knowledge graphs. In CREATE, only the question is given; reasoning over the graph is implicit in the model's parameters and thinking trace, similar to drawing connections for scientific research. Finding strong, distinct paths can be challenging.
  • Figure 2: Examples of model-generated paths $u$ compared against population paths, along with quality scores and minimum distance values. The first and last connect artists through classic relations of directing, acting, performing, etc. The second path is the weakest according to the assessed specificity, because a connection through St. Louis is potentially shared by many entities.
  • Figure 3: Alternative prompting methods can lead to improvements depending on the model. Iterate and Resample interventions lead to the highest creative utility scores.
  • Figure 4: Creative utility vs patience for the frontier models, as well as prompt variations for GPT-5-mini. We see utility values being similar at lower patience but the difference increasing as the patience values increases
  • Figure 5: This graph shows how the creative utility (patience=0.9) of a system changes when we include factuality in the objective. We see models trade off factuality for utility. At the most lenient, Gemini-3-Pro has the highest utility. However, at the strictest, GPT-5 is able to balance the two metrics better than other models.
  • ...and 4 more figures