Benchmarking Language Model Creativity: A Case Study on Code Generation

Yining Lu; Dixuan Wang; Tianjian Li; Dongwei Jiang; Sanjeev Khudanpur; Meng Jiang; Daniel Khashabi

Benchmarking Language Model Creativity: A Case Study on Code Generation

Yining Lu, Dixuan Wang, Tianjian Li, Dongwei Jiang, Sanjeev Khudanpur, Meng Jiang, Daniel Khashabi

TL;DR

The paper presents Denial Prompting to elicit creative code-generation strategies and NeoGauge to quantify both convergent (correct, constraint-adhering) and divergent (historically novel) creativity in LLMs, using the NeoCoder dataset built from Codeforces problems. Through extensive benchmarking (e.g., GPT-4 vs. other LLMs) and evaluation of reasoning strategies, GPT-4 shows the strongest creativity but remains far from human-level creativity, and most augmentation strategies fail to improve divergent creativity. The approach combines state-aware assessment with human-grounded baselines to provide a comprehensive, interpretable measure of machine creativity in problem solving, and it releases NeoCoder to enable future benchmarking. Overall, the framework reveals a persistent gap between current LLM creativity and human performance in constrained problem-solving tasks, while offering a scalable, extensible method for ongoing evaluation as models evolve.

Abstract

As LLMs become increasingly prevalent, it is interesting to consider how ``creative'' these models can be. From cognitive science, creativity consists of at least two key characteristics: \emph{convergent} thinking (purposefulness to achieve a given goal) and \emph{divergent} thinking (adaptability to explore new environments or constraints) \citep{runco2003critical}. In this work, we introduce a framework for quantifying LLM creativity that incorporates the two design ingredients: (1) We introduce DENIAL PROMPTING which pushes LLMs to develop more creative solutions to a given problem by incrementally imposing new constraints on the previous solution, compelling LLMs to adopt new strategies. (2) We define NEOGAUGE, a metric that quantifies both convergent and divergent thinking in the generated creative responses by LLMs. We test the proposed framework on Codeforces problems, which serve as both a natural dataset for coding tasks and a collection of prior human solutions. We quantify NEOGAUGE for various proprietary and open-source models and find that even the most creative model, GPT-4, still falls short of demonstrating human-like creativity. We also experiment with advanced reasoning strategies (MCTS, self-correction, etc.) and observe no significant improvement in creativity. As a by-product of our analysis, we release NEOCODER dataset for reproducing our results on future models.

Benchmarking Language Model Creativity: A Case Study on Code Generation

TL;DR

Abstract

Paper Structure (42 sections, 7 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 42 sections, 7 equations, 9 figures, 5 tables, 1 algorithm.

Introduction
Background and Related Works
Machine Creativity Evaluation.
Divergent Creative Thinking.
Convergent Creative Thinking.
Constructing the NeoCoder Dataset
Denial Prompting: Eliciting Creative Generations from LLMs
NeoCoder Dataset to Support Benchmarking LLM Creativity
Challenging problems.
Human solutions.
Human annotated test examples.
Augmentation with Denial Prompting.
Statistics for NeoCoder.
State-Aware and Human-Grounded Evaluation of Machine Creativity
Augmentation model vs target model.
...and 27 more sections

Figures (9)

Figure 1: An overview of how Denial Prompting encourages creative solutions. A solution space is a collection of all possible solutions at a certain state. A, B indicate atomic techniques (e.g., for-loops, if-else, etc.) used in the solution.
Figure 2: Proportion of the top 5 most common atomic techniques used by GPT-4 per state, compared to those in human solutions. In absense of any constraints (the first column), the model default to common and accessible techniques, like humans (the last column). This echoes our claim in §\ref{['sec: introduction']} that eliciting creative solutions is crucial for creativity evaluation.
Figure 3: Example of NeoGauge computation. The question comes from our NeoCoder dataset with ID https://codeforces.com/problemset/problem/1829/B and testing model $\mathbf{G}_\text{LM}$ here is GPT-4. For each state, we compute NeoGauge (Eq.\ref{['eq: creativity']}) as the probability of LM generating correct solutions that meet the given constraints (convergent creativity defined in Eq.\ref{['eq: convergent']}) and also exhibit $\mathcal{H}$-creativity (divergent creativity defined in Eq.\ref{['eq: divergent']}). However, none of the above three solutions are considered to be "creative" since convergent solutions may lack divergent creativity (e.g., state $t=0$). Alternatively, LLMs' hallucinated responses resulting in high $\mathcal{H}$-creativity, but often lack correctness and constraint following (e.g., state $t=1$). Therefore, truly creative works should not only be innovative but also appropriately solve a problem.
Figure 4: NeoGauge (left) and cumulative NeoGauge (right) across states.
Figure 5: A comparison of LLM and human creativity. //// denotes the performance difference of convergent creativity, and \\\\\\\\ denotes the difference of divergent creativity. We observe that Current LLMs still hardly demonstrate human-like creativity.
...and 4 more figures

Benchmarking Language Model Creativity: A Case Study on Code Generation

TL;DR

Abstract

Benchmarking Language Model Creativity: A Case Study on Code Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)