Table of Contents
Fetching ...

Generative AI and Creativity: A Systematic Literature Review and Meta-Analysis

Niklas Holzner, Sebastian Maier, Stefan Feuerriegel

TL;DR

This study tackles the question of whether Generative AI (GenAI) can match or enhance human creativity and how collaboration between humans and GenAI affects creativity and idea diversity. Using a PRISMA-based systematic review and random-effects meta-analysis of 28 studies (127 effect sizes, n=8214), the authors compare GenAI alone to humans, humans with GenAI, and the diversity of ideas produced. They report three core findings: GenAI's standalone creativity is about on par with human performance (g ≈ $-0.05$), human-GenAI collaboration yields a modest creativity boost (g ≈ $0.27$), but collaboration markedly reduces diversity (g ≈ $-0.86$), with substantial heterogeneity moderated by GenAI model, task, and participant background. The results suggest GenAI is best treated as an augmentation tool rather than a replacement for human creativity, and they highlight design and domain considerations to mitigate diversity losses in practical applications.

Abstract

Generative artificial intelligence (GenAI) is increasingly used to support a wide range of human tasks, yet empirical evidence on its effect on creativity remains scattered. Can GenAI generate ideas that are creative? To what extent can it support humans in generating ideas that are both creative and diverse? In this study, we conduct a meta-analysis to evaluate the effect of GenAI on the performance in creative tasks. For this, we first perform a systematic literature search, based on which we identify n = 28 relevant studies (m = 8214 participants) for inclusion in our meta-analysis. We then compute standardized effect sizes based on Hedges' g. We compare different outcomes: (i) how creative GenAI is; (ii) how creative humans augmented by GenAI are; and (iii) the diversity of ideas by humans augmented by GenAI. Our results show no significant difference in creative performance between GenAI and humans (g = -0.05), while humans collaborating with GenAI significantly outperform those working without assistance (g = 0.27). However, GenAI has a significant negative effect on the diversity of ideas for such collaborations between humans and GenAI (g = -0.86). We further analyze heterogeneity across different GenAI models (e.g., GPT-3.5, GPT-4), different tasks (e.g., creative writing, ideation, divergent thinking), and different participant populations (e.g., laypeople, business, academia). Overall, our results position GenAI as an augmentative tool that can support, rather than replace, human creativity-particularly in tasks benefiting from ideation support.

Generative AI and Creativity: A Systematic Literature Review and Meta-Analysis

TL;DR

This study tackles the question of whether Generative AI (GenAI) can match or enhance human creativity and how collaboration between humans and GenAI affects creativity and idea diversity. Using a PRISMA-based systematic review and random-effects meta-analysis of 28 studies (127 effect sizes, n=8214), the authors compare GenAI alone to humans, humans with GenAI, and the diversity of ideas produced. They report three core findings: GenAI's standalone creativity is about on par with human performance (g ≈ ), human-GenAI collaboration yields a modest creativity boost (g ≈ ), but collaboration markedly reduces diversity (g ≈ ), with substantial heterogeneity moderated by GenAI model, task, and participant background. The results suggest GenAI is best treated as an augmentation tool rather than a replacement for human creativity, and they highlight design and domain considerations to mitigate diversity losses in practical applications.

Abstract

Generative artificial intelligence (GenAI) is increasingly used to support a wide range of human tasks, yet empirical evidence on its effect on creativity remains scattered. Can GenAI generate ideas that are creative? To what extent can it support humans in generating ideas that are both creative and diverse? In this study, we conduct a meta-analysis to evaluate the effect of GenAI on the performance in creative tasks. For this, we first perform a systematic literature search, based on which we identify n = 28 relevant studies (m = 8214 participants) for inclusion in our meta-analysis. We then compute standardized effect sizes based on Hedges' g. We compare different outcomes: (i) how creative GenAI is; (ii) how creative humans augmented by GenAI are; and (iii) the diversity of ideas by humans augmented by GenAI. Our results show no significant difference in creative performance between GenAI and humans (g = -0.05), while humans collaborating with GenAI significantly outperform those working without assistance (g = 0.27). However, GenAI has a significant negative effect on the diversity of ideas for such collaborations between humans and GenAI (g = -0.86). We further analyze heterogeneity across different GenAI models (e.g., GPT-3.5, GPT-4), different tasks (e.g., creative writing, ideation, divergent thinking), and different participant populations (e.g., laypeople, business, academia). Overall, our results position GenAI as an augmentative tool that can support, rather than replace, human creativity-particularly in tasks benefiting from ideation support.

Paper Structure

This paper contains 22 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: PRISMA flow chart.
  • Figure 2: Pooled effect comparing the creative performance of humans vs. GenAI (RQ1). The forest plot summarizes the Hedges' $g$ effect sizes and 95% confidence intervals for a direct comparison between humans vs. GenAI (treatment: GenAI vs. control: human alone). Out of the 127 observations, 100 observations (participants $m = 4582$) compare differences in creative performance between humans and GenAI, and are thus included in the comparison. Each line is one estimate (the weight is shown at the right). The overall effect size of $g = -0.048$ indicates no statistically significant difference. The vertical line at $g = 0$ corresponds to a null effect; observations to the left favor the human control, and observations to the right favor GenAI. The bars are the estimated effect sizes, and the whiskers are 95% CIs. The orange dashed line is the mean pooled effect size and the orange shaded area is its 95% CI.
  • Figure 3: Heterogeneity analysis for RQ1 (creative performance of humans vs. GenAI). Violin plots show the distribution of observation-level Hedges' $g$ for the comparison in creative performance between human vs. GenAI conditions. The comparison is stratified by (a) GenAI model, (b) participant background, and (c) task type. Subgroup analyses are reported only for categories with a sufficient number of observations to support meaningful comparisons. The widths reflect the density of effect sizes; the dashed line corresponds to $g=-0.048$ with no overall difference.
  • Figure 4: Pooled effect of the benefit from Human-GenAI collaboration on creative performance (RQ2a). The forest plot summarizes the Hedges' $g$ effect sizes and 95% confidence intervals (treatment: human-GenAI collaboration versus control: human alone). Out of the 127 observations, $n = 21$ observations (participants $m = 2798$) quantify differences in creative performance between humans and human-GenAI collaboration. Each line is one estimate (the weight is shown at the right). The overall effect size of $g = 0.273$ indicates a modest performance gain from GenAI assistance. The vertical line at $g = 0$ corresponds to a null effect; points to the right favor the GenAI-assisted collaboration. The bars are the estimated effect sizes, and the whiskers are the 95% CIs. The orange dashed line is the mean pooled effect size and the orange shaded area is its 95% CI.
  • Figure 5: Heterogeneity analysis for RQ2a (creative performance of humans+GenAI vs. humans only). Violin plots show the distribution of observation-level Hedges' $g$ for the benefit in creative performance of human-GenAI collaboration over a human-only condition. The comparison is stratified by (a) GenAI model, (b) participant background, and (c) task type. Subgroup analyses are reported only for categories with a sufficient number of observations to support meaningful comparisons. The widths reflect the density of effect sizes; the dashed line corresponds to $g=0$ with no overall difference.
  • ...and 1 more figures