Table of Contents
Fetching ...

Investigating Conceptual Blending of a Diffusion Model for Improving Nonword-to-Image Generation

Chihaya Matsuhira, Marc A. Kastner, Takahiro Komamizu, Takatsugu Hirayama, Ichiro Ide

TL;DR

This paper analyzes the conceptual blending in a pretrained diffusion model, Stable Diffusion, and explores the best text embedding space conversion method of an existing nonword-to-image generation framework to ensure both the occurrence of conceptual blending and image generation quality.

Abstract

Text-to-image diffusion models sometimes depict blended concepts in the generated images. One promising use case of this effect would be the nonword-to-image generation task which attempts to generate images intuitively imaginable from a non-existing word (nonword). To realize nonword-to-image generation, an existing study focused on associating nonwords with similar-sounding words. Since each nonword can have multiple similar-sounding words, generating images containing their blended concepts would increase intuitiveness, facilitating creative activities and promoting computational psycholinguistics. Nevertheless, no existing study has quantitatively evaluated this effect in either diffusion models or the nonword-to-image generation paradigm. Therefore, this paper first analyzes the conceptual blending in a pretrained diffusion model, Stable Diffusion. The analysis reveals that a high percentage of generated images depict blended concepts when inputting an embedding interpolating between the text embeddings of two text prompts referring to different concepts. Next, this paper explores the best text embedding space conversion method of an existing nonword-to-image generation framework to ensure both the occurrence of conceptual blending and image generation quality. We compare the conventional direct prediction approach with the proposed method that combines $k$-nearest neighbor search and linear regression. Evaluation reveals that the enhanced accuracy of the embedding space conversion by the proposed method improves the image generation quality, while the emergence of conceptual blending could be attributed mainly to the specific dimensions of the high-dimensional text embedding space.

Investigating Conceptual Blending of a Diffusion Model for Improving Nonword-to-Image Generation

TL;DR

This paper analyzes the conceptual blending in a pretrained diffusion model, Stable Diffusion, and explores the best text embedding space conversion method of an existing nonword-to-image generation framework to ensure both the occurrence of conceptual blending and image generation quality.

Abstract

Text-to-image diffusion models sometimes depict blended concepts in the generated images. One promising use case of this effect would be the nonword-to-image generation task which attempts to generate images intuitively imaginable from a non-existing word (nonword). To realize nonword-to-image generation, an existing study focused on associating nonwords with similar-sounding words. Since each nonword can have multiple similar-sounding words, generating images containing their blended concepts would increase intuitiveness, facilitating creative activities and promoting computational psycholinguistics. Nevertheless, no existing study has quantitatively evaluated this effect in either diffusion models or the nonword-to-image generation paradigm. Therefore, this paper first analyzes the conceptual blending in a pretrained diffusion model, Stable Diffusion. The analysis reveals that a high percentage of generated images depict blended concepts when inputting an embedding interpolating between the text embeddings of two text prompts referring to different concepts. Next, this paper explores the best text embedding space conversion method of an existing nonword-to-image generation framework to ensure both the occurrence of conceptual blending and image generation quality. We compare the conventional direct prediction approach with the proposed method that combines -nearest neighbor search and linear regression. Evaluation reveals that the enhanced accuracy of the embedding space conversion by the proposed method improves the image generation quality, while the emergence of conceptual blending could be attributed mainly to the specific dimensions of the high-dimensional text embedding space.

Paper Structure

This paper contains 39 sections, 3 equations, 11 figures, 7 tables.

Figures (11)

  • Figure 1: Example of conceptual blending of a text-to-image diffusion model bib:latentdiffusionbib:stablediffusion when generating images from an interpolated text embedding (midpoint) between the embeddings of two text prompts referring to different concepts bib:melzi.
  • Figure 2: Distribution of CLIP scores in matching and mismatching pairs and the classification boundary between the two classes.
  • Figure 3: Examples of image generation results showing two types of conceptual blending targeted by this paper. Red, blue, and purple squares indicate cases where our method detected Concept A, Concept B, and both concepts, respectively. The images are generated from an interpolated embedding between Concepts A and B with an interpolation ratio of around 0.5.
  • Figure 6: Generalized framework for nonword-to-image generation bib:matsuhira1bib:matsuhira2bib:matsuhira3. The embedding space conversion method is improved to preserve the neighborhood relationships.
  • Figure 7: Nonword-to-image generation results exhibiting conceptual blending generated using different methods.
  • ...and 6 more figures