Table of Contents
Fetching ...

PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement

Zhijie Wang, Yuheng Huang, Da Song, Lei Ma, Tianyi Zhang

TL;DR

PromptCharm tackles the challenge of prompt engineering for novice users in text-to-image generation by introducing a mixed-initiative system that combines automated prompt refinement (via Promptist), multi-modal style exploration, attention-based explanations, image inpainting, and version-controlled iterations. The approach enables users to iteratively refine prompts and images through direct interactions with model attention and by exploring a database of image styles, supported by DAAM heatmaps that visualize token-to-image influence. Two within-subject user studies (n=24) demonstrate that PromptCharm yields higher image similarity and greater user satisfaction compared with two baselines lacking its interactive and explanatory features. The work highlights the value of explanations and rich feedback loops in human-AI co-creation, showing improved alignment to user intent and aesthetics, with implications for broader adoption of prompt engineering tools in creative workflows.

Abstract

The recent advancements in Generative AI have significantly advanced the field of text-to-image generation. The state-of-the-art text-to-image model, Stable Diffusion, is now capable of synthesizing high-quality images with a strong sense of aesthetics. Crafting text prompts that align with the model's interpretation and the user's intent thus becomes crucial. However, prompting remains challenging for novice users due to the complexity of the stable diffusion model and the non-trivial efforts required for iteratively editing and refining the text prompts. To address these challenges, we propose PromptCharm, a mixed-initiative system that facilitates text-to-image creation through multi-modal prompt engineering and refinement. To assist novice users in prompting, PromptCharm first automatically refines and optimizes the user's initial prompt. Furthermore, PromptCharm supports the user in exploring and selecting different image styles within a large database. To assist users in effectively refining their prompts and images, PromptCharm renders model explanations by visualizing the model's attention values. If the user notices any unsatisfactory areas in the generated images, they can further refine the images through model attention adjustment or image inpainting within the rich feedback loop of PromptCharm. To evaluate the effectiveness and usability of PromptCharm, we conducted a controlled user study with 12 participants and an exploratory user study with another 12 participants. These two studies show that participants using PromptCharm were able to create images with higher quality and better aligned with the user's expectations compared with using two variants of PromptCharm that lacked interaction or visualization support.

PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement

TL;DR

PromptCharm tackles the challenge of prompt engineering for novice users in text-to-image generation by introducing a mixed-initiative system that combines automated prompt refinement (via Promptist), multi-modal style exploration, attention-based explanations, image inpainting, and version-controlled iterations. The approach enables users to iteratively refine prompts and images through direct interactions with model attention and by exploring a database of image styles, supported by DAAM heatmaps that visualize token-to-image influence. Two within-subject user studies (n=24) demonstrate that PromptCharm yields higher image similarity and greater user satisfaction compared with two baselines lacking its interactive and explanatory features. The work highlights the value of explanations and rich feedback loops in human-AI co-creation, showing improved alignment to user intent and aesthetics, with implications for broader adoption of prompt engineering tools in creative workflows.

Abstract

The recent advancements in Generative AI have significantly advanced the field of text-to-image generation. The state-of-the-art text-to-image model, Stable Diffusion, is now capable of synthesizing high-quality images with a strong sense of aesthetics. Crafting text prompts that align with the model's interpretation and the user's intent thus becomes crucial. However, prompting remains challenging for novice users due to the complexity of the stable diffusion model and the non-trivial efforts required for iteratively editing and refining the text prompts. To address these challenges, we propose PromptCharm, a mixed-initiative system that facilitates text-to-image creation through multi-modal prompt engineering and refinement. To assist novice users in prompting, PromptCharm first automatically refines and optimizes the user's initial prompt. Furthermore, PromptCharm supports the user in exploring and selecting different image styles within a large database. To assist users in effectively refining their prompts and images, PromptCharm renders model explanations by visualizing the model's attention values. If the user notices any unsatisfactory areas in the generated images, they can further refine the images through model attention adjustment or image inpainting within the rich feedback loop of PromptCharm. To evaluate the effectiveness and usability of PromptCharm, we conducted a controlled user study with 12 participants and an exploratory user study with another 12 participants. These two studies show that participants using PromptCharm were able to create images with higher quality and better aligned with the user's expectations compared with using two variants of PromptCharm that lacked interaction or visualization support.
Paper Structure (50 sections, 16 figures, 3 tables, 1 algorithm)

This paper contains 50 sections, 16 figures, 3 tables, 1 algorithm.

Figures (16)

  • Figure 1: User interface of PromptCharm. (a) The user can first type their initial input prompt in a text box. PromptCharm will then automatically refine the user's prompt. (b) PromptCharm further supports users in efficiently exploring different image styles. (c) The user can then examine the generated images with the help of model attention visualization. If they would like to refine the image, they can further (d) adjust the model's attention to a keyword or (e) directly inpaint the image.
  • Figure 2: PromptCharm provides (a) automated prompt refinement and (b) prompt editing in the prompting view. The user can (c) replace a modifier with similar/dissimilar styles, (d) adjust the model's attention to a keyword, or (e) explore popular modifiers and (f) append them to the prompt.
  • Figure 3: PromptCharm provides version control with XAI feedback to help users iteratively improve their creations. (a) The user can efficiently switch between different versions. (b) The user can also observe the model's attention to each token. (c) The user can further hover over a token to check the corresponding parts in the generated image. (d) Once the user notices any "over-attending", they can directly adjust the attention to specific keywords. For example, by adjusting the attention of the word, "wolf", the model avoids mis-attending to the "human child" during the generation.
  • Figure 4: The user can (a) mask the undesired areas in a generated image then (b) re-generate these areas in PromptCharm. (c) The user can further provide text prompts to guide the inpainting process.
  • Figure 5: An example scenario of iteratively improving an image creation featuring "a wolf sitting next to a human child in front of the full moon."
  • ...and 11 more figures