Manipulating Embeddings of Stable Diffusion Prompts
Niklas Deckers, Julia Peters, Martin Potthast
TL;DR
The paper addresses the challenge of controlling image outputs in text-to-image generation by moving from prompt rewriting to direct manipulation of the prompt embedding $\mathcal{C}=\psi(\mathcal{P})$. It introduces gradient-based embedding optimization and three interaction tools that guide the embedding via a metric in image space, a near-embedding navigation, or seed-aware reconstruction, without updating model weights. Key contributions include the formal problem $\mathcal{C}^* = \underset{\mathcal{C}}{\mathrm{arg\,min}}\; m(\mathrm{LDM}(\mathcal{C}))$, the three practical methods, and a user study showing reduced tedium and user preference for results. The work enables fine-grained, seed-robust or seed-adaptive image generation and opens the possibility of sharing reusable embedding modifiers across prompts and domains.
Abstract
Prompt engineering is still the primary way for users of generative text-to-image models to manipulate generated images in a targeted way. Based on treating the model as a continuous function and by passing gradients between the image space and the prompt embedding space, we propose and analyze a new method to directly manipulate the embedding of a prompt instead of the prompt text. We then derive three practical interaction tools to support users with image generation: (1) Optimization of a metric defined in the image space that measures, for example, the image style. (2) Supporting a user in creative tasks by allowing them to navigate in the image space along a selection of directions of "near" prompt embeddings. (3) Changing the embedding of the prompt to include information that a user has seen in a particular seed but has difficulty describing in the prompt. Compared to prompt engineering, user-driven prompt embedding manipulation enables a more fine-grained, targeted control that integrates a user's intentions. Our user study shows that our methods are considered less tedious and that the resulting images are often preferred.
