Table of Contents
Fetching ...

TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling

Jun Li, Zedong Zhang, Jian Yang

TL;DR

TP2O introduces Balance Swap-Sampling (BASS), a training-free method to generate creative objects from two object texts by swapping prompt embeddings, enforcing a CLIP-distance balance region, and selecting via SAM-based segmentation. The method combines a swapping mechanism with a geometric balance region defined by $|d(I_f,I_1)-d(I_f,I_2)|\le\alpha$ and $d(I_f,I_1)+d(I_f,I_2)\le 2\beta$, followed by coarse-to-fine sampling and semantic scoring to pick an optimal composite image. Experimental results on 5075 ImageNet-derived prompt pairs show BASS outperforms state-of-the-art T2I models and even rivals human artworks in novelty and appeal, with user studies indicating strong preference for BASS outputs. The approach demonstrates a practical, training-free path to out-of-distribution creative synthesis and suggests extensions to multi-concept prompts and learned swapping strategies.

Abstract

Generating creative combinatorial objects from two seemingly unrelated object texts is a challenging task in text-to-image synthesis, often hindered by a focus on emulating existing data distributions. In this paper, we develop a straightforward yet highly effective method, called \textbf{balance swap-sampling}. First, we propose a swapping mechanism that generates a novel combinatorial object image set by randomly exchanging intrinsic elements of two text embeddings through a cutting-edge diffusion model. Second, we introduce a balance swapping region to efficiently sample a small subset from the newly generated image set by balancing CLIP distances between the new images and their original generations, increasing the likelihood of accepting the high-quality combinations. Last, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object from the sampled subset. Extensive experiments demonstrate that our approach outperforms recent SOTA T2I methods. Surprisingly, our results even rival those of human artists, such as frog-broccoli.

TP2O: Creative Text Pair-to-Object Generation using Balance Swap-Sampling

TL;DR

TP2O introduces Balance Swap-Sampling (BASS), a training-free method to generate creative objects from two object texts by swapping prompt embeddings, enforcing a CLIP-distance balance region, and selecting via SAM-based segmentation. The method combines a swapping mechanism with a geometric balance region defined by and , followed by coarse-to-fine sampling and semantic scoring to pick an optimal composite image. Experimental results on 5075 ImageNet-derived prompt pairs show BASS outperforms state-of-the-art T2I models and even rivals human artworks in novelty and appeal, with user studies indicating strong preference for BASS outputs. The approach demonstrates a practical, training-free path to out-of-distribution creative synthesis and suggests extensions to multi-concept prompts and learned swapping strategies.

Abstract

Generating creative combinatorial objects from two seemingly unrelated object texts is a challenging task in text-to-image synthesis, often hindered by a focus on emulating existing data distributions. In this paper, we develop a straightforward yet highly effective method, called \textbf{balance swap-sampling}. First, we propose a swapping mechanism that generates a novel combinatorial object image set by randomly exchanging intrinsic elements of two text embeddings through a cutting-edge diffusion model. Second, we introduce a balance swapping region to efficiently sample a small subset from the newly generated image set by balancing CLIP distances between the new images and their original generations, increasing the likelihood of accepting the high-quality combinations. Last, we employ a segmentation method to compare CLIP distances among the segmented components, ultimately selecting the most promising object from the sampled subset. Extensive experiments demonstrate that our approach outperforms recent SOTA T2I methods. Surprisingly, our results even rival those of human artists, such as frog-broccoli.
Paper Structure (21 sections, 5 equations, 26 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 5 equations, 26 figures, 6 tables, 1 algorithm.

Figures (26)

  • Figure 1: We propose a simple yet effective sampling method without any training to generate creative combinations from two object texts. Bottom row: original images from Stable-Diffusion2 Rombach2022latentDM. Middle row: combinations produced by our algorithm. Top row: artworks by https://www.lescreatonautes.fr/, a French creative agency, from the https://www.instagram.com/les.creatonautes/.
  • Figure 2: The pipeline of our balance swap-sampling method. Starting from text embeddings by inputting two given texts into the text encoder, we introduce a swapping operation to collect a set $\mathcal{F}$ of randomly swapping vectors for novel embeddings, then generate a new image set $\mathcal{I}$, and propose a balance region to build a sampling method for selecting an optimal combinatorial object image.
  • Figure 3: Visualization comparison of different swapping ways.
  • Figure 4: Geometrical visualization of the high-quality composite image's potential orange region by balancing the distances between $I_f$ and the anchor images $I_1$, $I_2$.
  • Figure 5: Visual comparisons of combinatorial object generations. We compare our BASS with Stable-Diffusion2 Rombach2022latentDM, DALLE2 Ramesh2022DALLE2, ERNIE-ViLG2 (Baidu) Feng2022ERNIE-ViLG2 and Bing (Microsoft) using a hybrid prompt. For a fairer comparison, we incorporate detailed textual descriptions alongside our generated images as input prompts. However, these models have not achieved results closely aligned with our own, highlighting the superior creative combinatorial capabilities of our BASS. Furthermore, our results notably differ from images retrieved from the LAION-5B dataset schuhmann2022laion in the first row, highlighting our model's capacity to produce out-of-distribution images.
  • ...and 21 more figures