Table of Contents
Fetching ...

Manipulating Embeddings of Stable Diffusion Prompts

Niklas Deckers, Julia Peters, Martin Potthast

TL;DR

The paper addresses the challenge of controlling image outputs in text-to-image generation by moving from prompt rewriting to direct manipulation of the prompt embedding $\mathcal{C}=\psi(\mathcal{P})$. It introduces gradient-based embedding optimization and three interaction tools that guide the embedding via a metric in image space, a near-embedding navigation, or seed-aware reconstruction, without updating model weights. Key contributions include the formal problem $\mathcal{C}^* = \underset{\mathcal{C}}{\mathrm{arg\,min}}\; m(\mathrm{LDM}(\mathcal{C}))$, the three practical methods, and a user study showing reduced tedium and user preference for results. The work enables fine-grained, seed-robust or seed-adaptive image generation and opens the possibility of sharing reusable embedding modifiers across prompts and domains.

Abstract

Prompt engineering is still the primary way for users of generative text-to-image models to manipulate generated images in a targeted way. Based on treating the model as a continuous function and by passing gradients between the image space and the prompt embedding space, we propose and analyze a new method to directly manipulate the embedding of a prompt instead of the prompt text. We then derive three practical interaction tools to support users with image generation: (1) Optimization of a metric defined in the image space that measures, for example, the image style. (2) Supporting a user in creative tasks by allowing them to navigate in the image space along a selection of directions of "near" prompt embeddings. (3) Changing the embedding of the prompt to include information that a user has seen in a particular seed but has difficulty describing in the prompt. Compared to prompt engineering, user-driven prompt embedding manipulation enables a more fine-grained, targeted control that integrates a user's intentions. Our user study shows that our methods are considered less tedious and that the resulting images are often preferred.

Manipulating Embeddings of Stable Diffusion Prompts

TL;DR

The paper addresses the challenge of controlling image outputs in text-to-image generation by moving from prompt rewriting to direct manipulation of the prompt embedding . It introduces gradient-based embedding optimization and three interaction tools that guide the embedding via a metric in image space, a near-embedding navigation, or seed-aware reconstruction, without updating model weights. Key contributions include the formal problem , the three practical methods, and a user study showing reduced tedium and user preference for results. The work enables fine-grained, seed-robust or seed-adaptive image generation and opens the possibility of sharing reusable embedding modifiers across prompts and domains.

Abstract

Prompt engineering is still the primary way for users of generative text-to-image models to manipulate generated images in a targeted way. Based on treating the model as a continuous function and by passing gradients between the image space and the prompt embedding space, we propose and analyze a new method to directly manipulate the embedding of a prompt instead of the prompt text. We then derive three practical interaction tools to support users with image generation: (1) Optimization of a metric defined in the image space that measures, for example, the image style. (2) Supporting a user in creative tasks by allowing them to navigate in the image space along a selection of directions of "near" prompt embeddings. (3) Changing the embedding of the prompt to include information that a user has seen in a particular seed but has difficulty describing in the prompt. Compared to prompt engineering, user-driven prompt embedding manipulation enables a more fine-grained, targeted control that integrates a user's intentions. Our user study shows that our methods are considered less tedious and that the resulting images are often preferred.
Paper Structure (15 sections, 9 equations, 11 figures, 1 algorithm)

This paper contains 15 sections, 9 equations, 11 figures, 1 algorithm.

Figures (11)

  • Figure 1: Our three techniques for manipulating prompt embeddings enable a user to (1) optimize an image quality metric, (2) navigate the prompt embedding space towards nearby variants, and (3) reconstruct a preferred image by introducing seed invariance.
  • Figure 2: Comparison of two approaches to interpolating between two prompt embeddings. NLERP results in unevenly distributed interpolated points on the sphere. Changing its interpolation parameter results in larger adjustments to the points near the center. SLERP provides more consistent control.
  • Figure 3: Selected example of an interpolation between two prompts, which can be found in our published data.
  • Figure 4: The user interface for our iterative human feedback method. The current image is shown on the bottom left. The choices are shown at the top. The bottom right shows a t-SNE maaten:2008 dimensionality reduction of the current embedding in the center and the five options scattered around.
  • Figure 5: Selected images generated with the prompt Single Color Ball and different random seeds.
  • ...and 6 more figures