Benchmarking Language Model Creativity: A Case Study on Code Generation
Yining Lu, Dixuan Wang, Tianjian Li, Dongwei Jiang, Sanjeev Khudanpur, Meng Jiang, Daniel Khashabi
TL;DR
The paper presents Denial Prompting to elicit creative code-generation strategies and NeoGauge to quantify both convergent (correct, constraint-adhering) and divergent (historically novel) creativity in LLMs, using the NeoCoder dataset built from Codeforces problems. Through extensive benchmarking (e.g., GPT-4 vs. other LLMs) and evaluation of reasoning strategies, GPT-4 shows the strongest creativity but remains far from human-level creativity, and most augmentation strategies fail to improve divergent creativity. The approach combines state-aware assessment with human-grounded baselines to provide a comprehensive, interpretable measure of machine creativity in problem solving, and it releases NeoCoder to enable future benchmarking. Overall, the framework reveals a persistent gap between current LLM creativity and human performance in constrained problem-solving tasks, while offering a scalable, extensible method for ongoing evaluation as models evolve.
Abstract
As LLMs become increasingly prevalent, it is interesting to consider how ``creative'' these models can be. From cognitive science, creativity consists of at least two key characteristics: \emph{convergent} thinking (purposefulness to achieve a given goal) and \emph{divergent} thinking (adaptability to explore new environments or constraints) \citep{runco2003critical}. In this work, we introduce a framework for quantifying LLM creativity that incorporates the two design ingredients: (1) We introduce DENIAL PROMPTING which pushes LLMs to develop more creative solutions to a given problem by incrementally imposing new constraints on the previous solution, compelling LLMs to adopt new strategies. (2) We define NEOGAUGE, a metric that quantifies both convergent and divergent thinking in the generated creative responses by LLMs. We test the proposed framework on Codeforces problems, which serve as both a natural dataset for coding tasks and a collection of prior human solutions. We quantify NEOGAUGE for various proprietary and open-source models and find that even the most creative model, GPT-4, still falls short of demonstrating human-like creativity. We also experiment with advanced reasoning strategies (MCTS, self-correction, etc.) and observe no significant improvement in creativity. As a by-product of our analysis, we release NEOCODER dataset for reproducing our results on future models.
