Table of Contents
Fetching ...

How do Humans and Language Models Reason About Creativity? A Comparative Analysis

Antonio Laverghetta, Tuhin Chakrabarty, Tom Hope, Jimmy Pronchick, Krupa Bhawsar, Roger E. Beaty

TL;DR

This work probes how humans and large language models (LLMs) reason about creativity in STEM tasks by using a Design Problems Task and a fine-grained set of originality facets: remoteness, uncommonness, and cleverness. In Study 1, human experts rated solutions with or without exemplar ratings, revealing that providing examples shifts emphasis away from remoteness and uncommonness toward cleverness, with corresponding changes in linguistic markers and no significant gains in original-score accuracy. Study 2 replicates the evaluation with LLMs (claude-3.5-haiku and gpt-4o-mini), finding that exemplars boost alignment with true originality but also cause near-homogenization of facet correlations, making the models highly predictive yet less facet-distinct. The results underscore important differences between human and AI creativity evaluation processes and caution against relying solely on LLMs for nuanced, facet-specific creativity judgments in STEM, while suggesting avenues to refine AI evaluators for interpretability and fairness.

Abstract

Creativity assessment in science and engineering is increasingly based on both human and AI judgment, but the cognitive processes and biases behind these evaluations remain poorly understood. We conducted two experiments examining how including example solutions with ratings impact creativity evaluation, using a finegrained annotation protocol where raters were tasked with explaining their originality scores and rating for the facets of remoteness (whether the response is "far" from everyday ideas), uncommonness (whether the response is rare), and cleverness. In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training, comparing those who received example solutions with ratings (example) to those who did not (no example). Computational text analysis revealed that, compared to experts with examples, no-example experts used more comparative language (e.g., "better/worse") and emphasized solution uncommonness, suggesting they may have relied more on memory retrieval for comparisons. In Study 2, parallel analyses with state-of-the-art LLMs revealed that models prioritized uncommonness and remoteness of ideas when rating originality, suggesting an evaluative process rooted around the semantic similarity of ideas. In the example condition, while LLM accuracy in predicting the true originality scores improved, the correlations of remoteness, uncommonness, and cleverness with originality also increased substantially -- to upwards of $0.99$ -- suggesting a homogenization in the LLMs evaluation of the individual facets. These findings highlight important implications for how humans and AI reason about creativity and suggest diverging preferences for what different populations prioritize when rating.

How do Humans and Language Models Reason About Creativity? A Comparative Analysis

TL;DR

This work probes how humans and large language models (LLMs) reason about creativity in STEM tasks by using a Design Problems Task and a fine-grained set of originality facets: remoteness, uncommonness, and cleverness. In Study 1, human experts rated solutions with or without exemplar ratings, revealing that providing examples shifts emphasis away from remoteness and uncommonness toward cleverness, with corresponding changes in linguistic markers and no significant gains in original-score accuracy. Study 2 replicates the evaluation with LLMs (claude-3.5-haiku and gpt-4o-mini), finding that exemplars boost alignment with true originality but also cause near-homogenization of facet correlations, making the models highly predictive yet less facet-distinct. The results underscore important differences between human and AI creativity evaluation processes and caution against relying solely on LLMs for nuanced, facet-specific creativity judgments in STEM, while suggesting avenues to refine AI evaluators for interpretability and fairness.

Abstract

Creativity assessment in science and engineering is increasingly based on both human and AI judgment, but the cognitive processes and biases behind these evaluations remain poorly understood. We conducted two experiments examining how including example solutions with ratings impact creativity evaluation, using a finegrained annotation protocol where raters were tasked with explaining their originality scores and rating for the facets of remoteness (whether the response is "far" from everyday ideas), uncommonness (whether the response is rare), and cleverness. In Study 1, we analyzed creativity ratings from 72 experts with formal science or engineering training, comparing those who received example solutions with ratings (example) to those who did not (no example). Computational text analysis revealed that, compared to experts with examples, no-example experts used more comparative language (e.g., "better/worse") and emphasized solution uncommonness, suggesting they may have relied more on memory retrieval for comparisons. In Study 2, parallel analyses with state-of-the-art LLMs revealed that models prioritized uncommonness and remoteness of ideas when rating originality, suggesting an evaluative process rooted around the semantic similarity of ideas. In the example condition, while LLM accuracy in predicting the true originality scores improved, the correlations of remoteness, uncommonness, and cleverness with originality also increased substantially -- to upwards of -- suggesting a homogenization in the LLMs evaluation of the individual facets. These findings highlight important implications for how humans and AI reason about creativity and suggest diverging preferences for what different populations prioritize when rating.

Paper Structure

This paper contains 15 sections, 15 figures, 1 table.

Figures (15)

  • Figure 1: Pearson correlations among pairwise Likert ratings for both conditions. o = originality, c = cleverness, u = uncommonness, r = remoteness.
  • Figure 2: Pearson correlations among pairwise Likert ratings for gpt-4o-mini in both conditions. o = originality, c = cleverness, u = uncommonness, r = remoteness.
  • Figure 3: Human and gpt-4o-mini originality scores.
  • Figure 4: Comparison between linguistic marker use from humans (top) and gpt-4o-mini (bottom), as assessed by gpt-4o. A rating of 1 indicates the feature is absent in the response, 2 indicates it is present.
  • Figure 5: Past/future language prompt. [RESPONSE] is filled with the participant explanation.
  • ...and 10 more figures