Table of Contents
Fetching ...

A Creative Agent is Worth a 64-Token Template

Ruixiao Shi, Fu Feng, Yucheng Xie, Xu Yang, Jing Wang, Xin Geng

Abstract

Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as ``a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human users. Recent reasoning- or agent-driven approaches iteratively augment prompts but incur high computational and monetary costs, as their instance-specific generation makes ``creativity'' costly and non-reusable, requiring repeated queries or reasoning for subsequent generations. To address this, we introduce \textbf{CAT}, a framework for \textbf{C}reative \textbf{A}gent \textbf{T}okenization that encapsulates agents' intrinsic understanding of ``creativity'' through a \textit{Creative Tokenizer}. Given the embeddings of fuzzy prompts, the tokenizer generates a reusable token template that can be directly concatenated with them to inject creative semantics into T2I models without repeated reasoning or prompt augmentation. To enable this, the tokenizer is trained via creative semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent's latent creative representations. Extensive experiments on \textbf{\textit{Architecture Design}}, \textbf{\textit{Furniture Design}}, and \textbf{\textit{Nature Mixture}} tasks demonstrate that CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation, achieving a $3.7\times$ speedup and a $4.8\times$ reduction in computational cost, while producing images with superior human preference and text-image alignment compared to state-of-the-art T2I models and creative generation methods.

A Creative Agent is Worth a 64-Token Template

Abstract

Text-to-image (T2I) models have substantially improved image fidelity and prompt adherence, yet their creativity remains constrained by reliance on discrete natural language prompts. When presented with fuzzy prompts such as ``a creative vinyl record-inspired skyscraper'', these models often fail to infer the underlying creative intent, leaving creative ideation and prompt design largely to human users. Recent reasoning- or agent-driven approaches iteratively augment prompts but incur high computational and monetary costs, as their instance-specific generation makes ``creativity'' costly and non-reusable, requiring repeated queries or reasoning for subsequent generations. To address this, we introduce \textbf{CAT}, a framework for \textbf{C}reative \textbf{A}gent \textbf{T}okenization that encapsulates agents' intrinsic understanding of ``creativity'' through a \textit{Creative Tokenizer}. Given the embeddings of fuzzy prompts, the tokenizer generates a reusable token template that can be directly concatenated with them to inject creative semantics into T2I models without repeated reasoning or prompt augmentation. To enable this, the tokenizer is trained via creative semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent's latent creative representations. Extensive experiments on \textbf{\textit{Architecture Design}}, \textbf{\textit{Furniture Design}}, and \textbf{\textit{Nature Mixture}} tasks demonstrate that CAT provides a scalable and effective paradigm for enhancing creativity in T2I generation, achieving a speedup and a reduction in computational cost, while producing images with superior human preference and text-image alignment compared to state-of-the-art T2I models and creative generation methods.
Paper Structure (29 sections, 8 equations, 12 figures, 3 tables)

This paper contains 29 sections, 8 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: CAT: Creative Agent Tokenization for Efficient Creative Generation. CAT generates the token template that directly injects creative intent into fuzzy prompts, enabling efficient, high-quality combinatorial creativity across Architecture Design, Furniture Design, and Nature Mixture tasks.
  • Figure 2: Breaking the Quality-Efficiency Bottleneck in Creative Generation. Compared with (a) direct generation, (b) think-then-generation and (c) agent-based approaches improve conceptual fusion but incur high time and API costs (see App. \ref{['app:cost']}). In contrast, (d) CAT encapsulates the agent’s intrinsic understanding of "creativity" in a reusable token template that directly enhances semantic representations, delivering superior visual quality with minimal cost and inference time.
  • Figure 3: Overview of CAT.(a) We introduce a Creative Augmentor to augment fuzzy prompts and a Creative Evaluator to filter generated concepts, with valid ones stored in a Concept Pool. (b) The Creative Tokenizer maps each fuzzy prompt embedding to a corresponding token template, which is directly concatenated with the fuzzy embedding to form a creative embedding for creative generation. (c) The Creative Tokenizer is trained via semantic disentanglement, leveraging relations among partially overlapping concept pairs to capture the agent’s implicit understanding of creativity.
  • Figure 4: Performance of CAT on the Architecture Design and Furniture Design Tasks. We compare CAT with the representative open-source T2I model FLUX.1 black2024, proprietary models GPT-Image-1.5 openai2025gptimage15 and Gemini 3.1 Flash Image google2026geminiflashimage (enhanced with Gemini-3.1-Pro for complex reasoning), as well as agent-driven creative generation methods T2I-Copilot chen2025t2i and CREA venkateshcrea.
  • Figure 5: Performance of CAT on the Nature Mixture Task. We compare CAT with representative T2I models, including FLUX.1 black2024 and Stable Diffusion 3 (SD3) esser2024scaling, as well as state-of-the-art creative generation methods such as BASS li2024tp2o, AGSwap zhang2025agswap, and CreTok feng2025redefining.
  • ...and 7 more figures