Table of Contents
Fetching ...

Creativity Benchmark: A benchmark for marketing creativity for large language models

Ninad Bhat, Kieran Browne, Pip Bingemann

TL;DR

Creativity Benchmark addresses the challenge of evaluating marketing creativity in large language models by grounding assessments in real-world brand briefs. It combines human-practitioner judgments (via 11,012 pairwise comparisons across 100 brands and three prompts) with model-diversity analysis, LLM-as-judge experiments, and TTCT/DAT-style tests to illuminate transfer gaps and evaluation reliability. The study finds tight clustering of model performance (Δθ ≈ 0.45, head-to-head win probability ≈ 0.61), weak-to-moderate alignment between automated judges and human preferences, and partial transfer from conventional creativity tests to brand tasks. Practically, it argues for expert human evaluation, diversity-aware workflows, and ensemble strategies to maximize ideation coverage while keeping final selection in human hands; it also emphasizes testing for fit with brand voice, latency, cost, and workflow integration over chasing leaderboard superiority. $P(i \,\succ\ j) = \frac{e^{\theta_i}}{e^{\theta_i}+e^{\theta_j}}$ and $Δ\theta \approx 0.45$ are central illustrative metrics, highlighting modest practical differences between leaders and laggards despite large-scale comparisons.

Abstract

We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is $Δθ\approx 0.45$, which implies a head-to-head win probability of $0.61$; the highest-rated model beats the lowest only about $61\%$ of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.

Creativity Benchmark: A benchmark for marketing creativity for large language models

TL;DR

Creativity Benchmark addresses the challenge of evaluating marketing creativity in large language models by grounding assessments in real-world brand briefs. It combines human-practitioner judgments (via 11,012 pairwise comparisons across 100 brands and three prompts) with model-diversity analysis, LLM-as-judge experiments, and TTCT/DAT-style tests to illuminate transfer gaps and evaluation reliability. The study finds tight clustering of model performance (Δθ ≈ 0.45, head-to-head win probability ≈ 0.61), weak-to-moderate alignment between automated judges and human preferences, and partial transfer from conventional creativity tests to brand tasks. Practically, it argues for expert human evaluation, diversity-aware workflows, and ensemble strategies to maximize ideation coverage while keeping final selection in human hands; it also emphasizes testing for fit with brand voice, latency, cost, and workflow integration over chasing leaderboard superiority. and are central illustrative metrics, highlighting modest practical differences between leaders and laggards despite large-scale comparisons.

Abstract

We introduce Creativity Benchmark, an evaluation framework for large language models (LLMs) in marketing creativity. The benchmark covers 100 brands (12 categories) and three prompt types (Insights, Ideas, Wild Ideas). Human pairwise preferences from 678 practising creatives over 11,012 anonymised comparisons, analysed with Bradley-Terry models, show tightly clustered performance with no model dominating across brands or prompt types: the top-bottom spread is , which implies a head-to-head win probability of ; the highest-rated model beats the lowest only about of the time. We also analyse model diversity using cosine distances to capture intra- and inter-model variation and sensitivity to prompt reframing. Comparing three LLM-as-judge setups with human rankings reveals weak, inconsistent correlations and judge-specific biases, underscoring that automated judges cannot substitute for human evaluation. Conventional creativity tests also transfer only partially to brand-constrained tasks. Overall, the results highlight the need for expert human evaluation and diversity-aware workflows.

Paper Structure

This paper contains 105 sections, 7 equations, 34 figures, 21 tables.

Figures (34)

  • Figure 1: Overall Model Bradley-Terry scores (higher $\theta$ indicates better performance).
  • Figure 4: Intra-model diversity: average cosine distance between pairs of responses produced by the same model for the same prompt. Higher values indicate greater diversity.
  • Figure 5: Inter-model diversity: average cosine distance between responses generated by different models for the same prompt. Higher values indicate that the models explore more distinct regions of the idea space.
  • Figure 6: Average nearest-neighbour cosine distance between each model’s Ideas and Wild Ideas response sets (symmetrised; higher indicates a larger semantic shift).
  • Figure 7: Top vs. bottom brand sets: inter-model diversity by prompt type. Higher average cosine distance indicates greater semantic separation.
  • ...and 29 more figures