Table of Contents
Fetching ...

Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation

Fu Feng, Yucheng Xie, Xu Yang, Jing Wang, Xin Geng

TL;DR

This work tackles the challenge of abstract creativity in diffusion-based image synthesis by redefining 'creative' as a universal token <CreTok>, learned through a TP2O-oriented CangJie dataset. The approach enables zero-shot combinatorial generation without task-specific retraining, significantly improving text–image alignment and human-perceived creativity. By optimizing the cosine similarity between restrictive and adaptive prompts and continually refining <CreTok> over diverse text pairs, the method achieves cohesive fusion of concepts (e.g., Lettuce and Mantis) and extends to multi-concept CT2I tasks, while remaining efficient (≈4 seconds per image). Extensive evaluations, including GPT-4o and a user study, show CreTok outperforms SOTA diffusion models and existing creative-generation methods in integration, originality, and aesthetics, with broad universality across styles and prompts.

Abstract

``Creative'' remains an inherently abstract concept for both humans and diffusion models. While text-to-image (T2I) diffusion models can easily generate out-of-distribution concepts like ``a blue banana'', they struggle with generating combinatorial objects such as ``a creative mixture that resembles a lettuce and a mantis'', due to difficulties in understanding the semantic depth of ``creative''. Current methods rely heavily on synthesizing reference prompts or images to achieve a creative effect, typically requiring retraining for each unique creative output-a process that is computationally intensive and limits practical applications. To address this, we introduce CreTok, which brings meta-creativity to diffusion models by redefining ``creative'' as a new token, \texttt{<CreTok>}, thus enhancing models' semantic understanding for combinatorial creativity. CreTok achieves such redefinition by iteratively sampling diverse text pairs from our proposed CangJie dataset to form adaptive prompts and restrictive prompts, and then optimizing the similarity between their respective text embeddings. Extensive experiments demonstrate that <CreTok> enables the universal and direct generation of combinatorial creativity across diverse concepts without additional training, achieving state-of-the-art performance with improved text-image alignment and higher human preference ratings. Code will be made available at https://github.com/fu-feng/CreTok.

Redefining <Creative> in Dictionary: Towards an Enhanced Semantic Understanding of Creative Generation

TL;DR

This work tackles the challenge of abstract creativity in diffusion-based image synthesis by redefining 'creative' as a universal token <CreTok>, learned through a TP2O-oriented CangJie dataset. The approach enables zero-shot combinatorial generation without task-specific retraining, significantly improving text–image alignment and human-perceived creativity. By optimizing the cosine similarity between restrictive and adaptive prompts and continually refining <CreTok> over diverse text pairs, the method achieves cohesive fusion of concepts (e.g., Lettuce and Mantis) and extends to multi-concept CT2I tasks, while remaining efficient (≈4 seconds per image). Extensive evaluations, including GPT-4o and a user study, show CreTok outperforms SOTA diffusion models and existing creative-generation methods in integration, originality, and aesthetics, with broad universality across styles and prompts.

Abstract

``Creative'' remains an inherently abstract concept for both humans and diffusion models. While text-to-image (T2I) diffusion models can easily generate out-of-distribution concepts like ``a blue banana'', they struggle with generating combinatorial objects such as ``a creative mixture that resembles a lettuce and a mantis'', due to difficulties in understanding the semantic depth of ``creative''. Current methods rely heavily on synthesizing reference prompts or images to achieve a creative effect, typically requiring retraining for each unique creative output-a process that is computationally intensive and limits practical applications. To address this, we introduce CreTok, which brings meta-creativity to diffusion models by redefining ``creative'' as a new token, \texttt{<CreTok>}, thus enhancing models' semantic understanding for combinatorial creativity. CreTok achieves such redefinition by iteratively sampling diverse text pairs from our proposed CangJie dataset to form adaptive prompts and restrictive prompts, and then optimizing the similarity between their respective text embeddings. Extensive experiments demonstrate that <CreTok> enables the universal and direct generation of combinatorial creativity across diverse concepts without additional training, achieving state-of-the-art performance with improved text-image alignment and higher human preference ratings. Code will be made available at https://github.com/fu-feng/CreTok.

Paper Structure

This paper contains 35 sections, 4 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: (a) In out-of-distribution generation, diffusion models can directly generate "a blue banana" without additional training, benefiting from the clear and concrete semantics of "blue". (b) However, they lack an intrinsic understanding of the abstract and ambiguous semantics of "creative". (c) Leveraging the TP2O (i.e., Creative Text Pair to Object) task, we redefine the token associated with "creative" as <CreTok> to bring models meta-creativity, allowing them to directly generate combinatorial creativity by enhancing their semantic understanding of "creative".
  • Figure 2: In each training iteration, a text pair and a prompt template are sampled to create a restrictive prompt and an adaptive prompt. The trainable <CreTok> token is then optimized to minimize the cosine similarity between the text embeddings of the adaptive and restrictive prompt. Then the refined adaptive prompt is input into a diffusion model (e.g., Stable Diffusion 3 esser2024scaling) for creative image generation.
  • Figure 3: $\texttt{<CreTok>}$ enhances diffusion models' semantic understanding of combinatorial creativity. We compare CreTok with SOTA T2I diffusion models including Stable Diffusion 3 ramesh2022hierarchical, Kandinsky 3 razzhigaev2023kandinsky, Stable Diffusion 3.5 stability2024, DALL-E 3 ramesh2022hierarchical and Midjourney v6.1 midjourney with identical prompts. CreTok, built on Stable Diffusion 3, replaces "creative" in prompts with the redefined <CreTok>.
  • Figure 4: Visual comparisons of combinatorial creativity. We compare CreTok with BASS litp2o, and other methods achieving similar combinatorial effects, including MagicMix liew2022magicmix and Black-Scholes kothandaraman2024prompt, to highlight CreTok's superior performance. For fair comparison, most images from these methods are sourced directly from the original papers, with a white watermark added in the bottom right corner. Additionally, generation time per image is recorded to emphasize CreTok's meta-creativity and zero-shot capability.
  • Figure 5: Combinatorial creativity with no concepts or two more concepts. Images with white watermarks are directly sourced from the original paper of the comparison method.
  • ...and 10 more figures