Assessing and Understanding Creativity in Large Language Models

Yunpu Zhao; Rui Zhang; Wenyi Li; Di Huang; Jiaming Guo; Shaohui Peng; Yifan Hao; Yuanbo Wen; Xing Hu; Zidong Du; Qi Guo; Ling Li; Yunji Chen

Assessing and Understanding Creativity in Large Language Models

Yunpu Zhao, Rui Zhang, Wenyi Li, Di Huang, Jiaming Guo, Shaohui Peng, Yifan Hao, Yuanbo Wen, Xing Hu, Zidong Du, Qi Guo, Ling Li, Yunji Chen

TL;DR

This work addresses the challenge of quantifying creativity in large language models by adapting the Torrance Tests of Creative Thinking (TTCT) into a scalable, automated framework. It builds a 700-question dataset across seven TTCT-inspired verbal tasks and evaluates six LLMs using four criteria: Fluency, Flexibility, Originality, and Elaboration, with GPT-4 as the scoring agent and human validation for reliability. The study shows substantial variation in creativity across models, prompt types, and role-play settings, with elaboration typically strong and originality variable; collaboration among models and alignment with personality traits can further influence creativity. The findings offer a practical, scalable methodology for AI creativity assessment and illuminate how model design, prompting, and psychometric factors shape creative output, bridging AI behavior with human cognitive theories and potential applications.

Abstract

In the field of natural language processing, the rapid development of large language model (LLM) has attracted more and more attention. LLMs have shown a high level of creativity in various tasks, but the methods for assessing such creativity are inadequate. The assessment of LLM creativity needs to consider differences from humans, requiring multi-dimensional measurement while balancing accuracy and efficiency. This paper aims to establish an efficient framework for assessing the level of creativity in LLMs. By adapting the modified Torrance Tests of Creative Thinking, the research evaluates the creative performance of various LLMs across 7 tasks, emphasizing 4 criteria including Fluency, Flexibility, Originality, and Elaboration. In this context, we develop a comprehensive dataset of 700 questions for testing and an LLM-based evaluation method. In addition, this study presents a novel analysis of LLMs' responses to diverse prompts and role-play situations. We found that the creativity of LLMs primarily falls short in originality, while excelling in elaboration. Besides, the use of prompts and the role-play settings of the model significantly influence creativity. Additionally, the experimental results also indicate that collaboration among multiple LLMs can enhance originality. Notably, our findings reveal a consensus between human evaluations and LLMs regarding the personality traits that influence creativity. The findings underscore the significant impact of LLM design on creativity and bridges artificial intelligence and human creativity, offering insights into LLMs' creativity and potential applications.

Assessing and Understanding Creativity in Large Language Models

TL;DR

Abstract

Paper Structure (22 sections, 5 figures, 2 tables)

This paper contains 22 sections, 5 figures, 2 tables.

Introduction
Human creativity: assessment and related personality traits
Creativity assessment in psychological research
Creativity and personality: findings in psychological research
Assessing the creativity of large language models
Overview of the framework
Dataset construction
Evaluation criteria
LLM-based evaluation
Evaluation results
Results of different models and criteria
Results of different prompt types
Results of playing different roles
Results of creativity under collaboration
Investigation of the relationship between LLM's creativity and its personality traits
...and 7 more sections

Figures (5)

Figure 1: This figure depicts the overall framework used to assess creativity in this paper. First, we referred to the task settings of the TTCT to generate our dataset, and then used this dataset to evaluate various large language models. The evaluation experiments included different prompts and comparative tests where LLMs played different roles. Finally, we used GPT-4 as an evaluator to assess the results of these models.
Figure 2: a. Results of overall creativity scores of the seven models. The error bars represents the standard deviation. The measure of the centre for the error bars represents the average rating. Significance markers are placed above the bars, where *** indicates a $p-value < 0.0001$, and * represents $p-value < 0.05$. Since the scoring data does not follow a normal distribution and is paired, the hypothesis test employed is the Wilcoxon signed-rank test. b. This is the heatmap of the win rate relationship between tested LLMs. The values in each grid represent the win rate of the model on the corresponding vertical axis compared to the model on the horizontal axis. c. This figure displays the scores for contextual relevance and coherence and consistency of models' answers. d. Results of overall creativity scores under four criteria. The error bars represents the standard deviation. The measure of the centre for the error bars represents the average rating. The statistical test is the same as the test in a.
Figure 3: The figure shows a radar chart of the performance of six models under four creativity assessment criteria across seven tasks.
Figure 4: a. In this figure, we have compiled statistics on the impact of different prompt types across various tasks and according to different criterion of creativity. Herein, a '$\checkmark$' signifies an enhancement in creativity, while a '---' indicates no significant effect. Significance markers are placed near '$\checkmark$', where *** indicates a $p-value < 0.0001$, and * represents $p-value < 0.05$ which is calculated by Wilcoxon signed-rank test. b. This figure depicts the performance of creativity across all criteria for different types of prompts. All the hypothesis tests have been given in the figure above. c. This figure depicts the performance of creativity across all tasks for different types of prompts. All the hypothesis tests have been given in the figure above. d. The figure illustrates the values for each creativity metric of the LLM across all tasks when assigned different roles. The horizontal line in the figure indicates the level of creativity of the LLM without any role-play system prompt.
Figure 5: In this figure, we presented scatter plots of the creativity scores under different criteria, varying by the number of rounds and agents. The area of each scatter point represents the level of creativity.

Assessing and Understanding Creativity in Large Language Models

TL;DR

Abstract

Assessing and Understanding Creativity in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)