Table of Contents
Fetching ...

CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints

Anirudh Atmakuru, Jatin Nainani, Rohith Siddhartha Reddy Bheemreddy, Anirudh Lakkaraju, Zonghai Yao, Hamed Zamani, Haw-Shiuan Chang

TL;DR

The experiments on LLaMA, Gemma, and Mistral not only highlight the creativity challenges LLMs face when dealing with highly specific prompts but also reveal that different LLMs perform very differently under different numbers of constraints and achieve different balances between the model's instruction-following ability and narrative coherence.

Abstract

Evaluating the creativity of large language models (LLMs) in story writing is difficult because LLM-generated stories could seemingly look creative but be very similar to some existing stories in their huge and proprietary training corpus. To overcome this challenge, we introduce a novel benchmark dataset with varying levels of prompt specificity: CS4 ($\mathbf{C}$omparing the $\mathbf{S}$kill of $\mathbf{C}$reating $\mathbf{S}$tories by $\mathbf{C}$ontrolling the $\mathbf{S}$ynthesized $\mathbf{C}$onstraint $\mathbf{S}$pecificity). By increasing the number of requirements/constraints in the prompt, we can increase the prompt specificity and hinder LLMs from retelling high-quality narratives in their training data. Consequently, CS4 empowers us to indirectly measure the LLMs' creativity without human annotations. Our experiments on LLaMA, Gemma, and Mistral not only highlight the creativity challenges LLMs face when dealing with highly specific prompts but also reveal that different LLMs perform very differently under different numbers of constraints and achieve different balances between the model's instruction-following ability and narrative coherence. Additionally, our experiments on OLMo suggest that Learning from Human Feedback (LHF) can help LLMs select better stories from their training data but has limited influence in boosting LLMs' ability to produce creative stories that are unseen in the training corpora. The benchmark is released at https://github.com/anirudhlakkaraju/cs4_benchmark.

CS4: Measuring the Creativity of Large Language Models Automatically by Controlling the Number of Story-Writing Constraints

TL;DR

The experiments on LLaMA, Gemma, and Mistral not only highlight the creativity challenges LLMs face when dealing with highly specific prompts but also reveal that different LLMs perform very differently under different numbers of constraints and achieve different balances between the model's instruction-following ability and narrative coherence.

Abstract

Evaluating the creativity of large language models (LLMs) in story writing is difficult because LLM-generated stories could seemingly look creative but be very similar to some existing stories in their huge and proprietary training corpus. To overcome this challenge, we introduce a novel benchmark dataset with varying levels of prompt specificity: CS4 (omparing the kill of reating tories by ontrolling the ynthesized onstraint pecificity). By increasing the number of requirements/constraints in the prompt, we can increase the prompt specificity and hinder LLMs from retelling high-quality narratives in their training data. Consequently, CS4 empowers us to indirectly measure the LLMs' creativity without human annotations. Our experiments on LLaMA, Gemma, and Mistral not only highlight the creativity challenges LLMs face when dealing with highly specific prompts but also reveal that different LLMs perform very differently under different numbers of constraints and achieve different balances between the model's instruction-following ability and narrative coherence. Additionally, our experiments on OLMo suggest that Learning from Human Feedback (LHF) can help LLMs select better stories from their training data but has limited influence in boosting LLMs' ability to produce creative stories that are unseen in the training corpora. The benchmark is released at https://github.com/anirudhlakkaraju/cs4_benchmark.
Paper Structure (30 sections, 1 equation, 22 figures, 3 tables)

This paper contains 30 sections, 1 equation, 22 figures, 3 tables.

Figures (22)

  • Figure 1: Comparison between CS4 and existing benchmarks. (a) Depiction of training corpora subsets for different narrative themes, illustrating the decreasing availability of training examples for LLMs as prompt specificity increases. (b) In response to general instructions, LLM1 tends to copy the relevant high-quality stories from its training corpus to achieve a good score in existing story-writing benchmarks. CS4 measures LLMs’ creativity by comparing LLMs’ performance drops for more specific instructions. Given more constraints, LLM2 could leverage very limited training data to output higher-quality stories than LLM1, so LLM2 is more creative.
  • Figure 2: An overview of the evaluation process in CS4 benchmark. First, we use two different few-shot in-context learning approaches to synthesize 39 constraints from every user instruction and conduct sub-sampling to create the prompts with fewer constraints. Next, for each user instruction, the testing LLMs of interest revise a base story, which is generated without seeing the constraints, to satisfy the constraints. The revised stories are evaluated in terms of their constraint satisfaction ratio, quality, and diversity. Finally, we estimate LLMs' creativity by summarizing their coherence scores and instruction satisfaction ratios for different number of constraints.
  • Figure 3: Analyzing the trade-off between coherence and constraint satisfaction.
  • Figure 4: Generation diversity for stories written using story-based constraints.
  • Figure 5: General framework of the system prompt used to generate instruction-based constraints.
  • ...and 17 more figures