Table of Contents
Fetching ...

Probing and Inducing Combinational Creativity in Vision-Language Models

Yongqian Peng, Yuxi Ma, Mengmeng Wang, Yuxuan Wang, Yizhou Wang, Chi Zhang, Yixin Zhu, Zilong Zheng

TL;DR

The paper investigates whether Vision-Language Models exhibit genuine combinational creativity or rely on pattern matching. It introduces the Identification-Explanation-Implication (IEI) framework grounded in conceptual blending and creates the CreativeMashup dataset with expert annotations to evaluate both understanding and generation. Across comprehension tasks, state-of-the-art models surpass average human performance but do not reach expert-level understanding, while incorporating IEI into the generation pipeline significantly enhances creative outputs. The work provides a theoretical foundation and practical guidelines for evaluating and improving artificial combinational creativity in VLMs, along with resources to advance future research.

Abstract

The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence. Recent advances in Vision-Language Models (VLMs) like GPT-4V and DALLE-3 have sparked debate about whether their outputs reflect combinational creativity--defined by M. A. Boden (1998) as synthesizing novel ideas through combining existing concepts--or sophisticated pattern matching of training data. Drawing inspiration from cognitive science, we investigate the combinational creativity of VLMs from the lens of concept blending. We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels: identifying input spaces, extracting shared attributes, and deriving novel semantic implications. To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework. Through extensive experiments, we demonstrate that in comprehension tasks, best VLMs have surpassed average human performance while falling short of expert-level understanding; in generation tasks, incorporating our IEI framework into the generation pipeline significantly enhances the creative quality of VLMs' outputs. Our findings establish both a theoretical foundation for evaluating artificial creativity and practical guidelines for improving creative generation in VLMs.

Probing and Inducing Combinational Creativity in Vision-Language Models

TL;DR

The paper investigates whether Vision-Language Models exhibit genuine combinational creativity or rely on pattern matching. It introduces the Identification-Explanation-Implication (IEI) framework grounded in conceptual blending and creates the CreativeMashup dataset with expert annotations to evaluate both understanding and generation. Across comprehension tasks, state-of-the-art models surpass average human performance but do not reach expert-level understanding, while incorporating IEI into the generation pipeline significantly enhances creative outputs. The work provides a theoretical foundation and practical guidelines for evaluating and improving artificial combinational creativity in VLMs, along with resources to advance future research.

Abstract

The ability to combine existing concepts into novel ideas stands as a fundamental hallmark of human intelligence. Recent advances in Vision-Language Models (VLMs) like GPT-4V and DALLE-3 have sparked debate about whether their outputs reflect combinational creativity--defined by M. A. Boden (1998) as synthesizing novel ideas through combining existing concepts--or sophisticated pattern matching of training data. Drawing inspiration from cognitive science, we investigate the combinational creativity of VLMs from the lens of concept blending. We propose the Identification-Explanation-Implication (IEI) framework, which decomposes creative processes into three levels: identifying input spaces, extracting shared attributes, and deriving novel semantic implications. To validate this framework, we curate CreativeMashup, a high-quality dataset of 666 artist-generated visual mashups annotated according to the IEI framework. Through extensive experiments, we demonstrate that in comprehension tasks, best VLMs have surpassed average human performance while falling short of expert-level understanding; in generation tasks, incorporating our IEI framework into the generation pipeline significantly enhances the creative quality of VLMs' outputs. Our findings establish both a theoretical foundation for evaluating artificial creativity and practical guidelines for improving creative generation in VLMs.

Paper Structure

This paper contains 37 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Combinational creativity across domains. Examples showing how combining two distinct elements creates novel concepts: sports car $+$ humanoid robot $\rightarrow$ Transformer (entertainment), utility cart $+$ hard case $\rightarrow$ wheeled luggage (industrial design), and human portrait $+$ lioness $\rightarrow$ Great Sphinx (ancient architecture). Each combination demonstrates how merging basic elements generates innovative outcomes.
  • Figure 2: Examples of the comprehension task and generation task. (a) The understanding task demonstrates three evaluation components using a fish-garbage mashup image: human participants or vlm identify primary objects, explain combination attributes, and interpret implications. (b) The generation task compares outputs from human experts and two model settings (Identification + Implication vs. Identification + Explanation + Implication) across three concept pairs (heart-trash, pistol-megaphone, paper money-mask).
  • Figure 3: Pairwise model comparison on implication task. The heatmap displays winning probabilities, where each cell $(i,j)$ shows the win rate of row model $i$vs. column model $j$. Darker red indicates higher win rates, while darker blue represents lower win rates.
  • Figure 4: Two types of combination in comprehension tasks. (a) Replacement maintains functional or visual similarity while substituting for safer or more accessible alternatives. (b) Fusion merges two unrelated concepts to create a novel composite that inherits properties from both sources.
  • Figure 5: Precision in identification task by combination type: replacement vs. fusion. Precision metrics across combination categories show consistently higher precision for replacement-based combinations compared to fusion-based ones.
  • ...and 2 more figures