Table of Contents
Fetching ...

Enhancing Generative AI Image Refinement with Scribbles and Annotations: A Comparative Study of Multimodal Prompts

Hyerim Park, Phuong Thao Tran, Andre Luckow, Ceenu George, Michael Sedlmair, Malin Eiband

TL;DR

This study addresses the refinement gap in GenAI image tools by introducing pen-based scribbles and annotations as actionable prompts. Through a formative study and a within-subjects user study with 30 designers and design students, the authors compare text-only, visual-only, and combined prompting across spatial, semantic, and iterative refinement tasks. Key findings show visual prompts improve spatial clarity and speed, text prompts excel for semantic control, and the combined modality yields the highest overall satisfaction by enabling complementary strategies. The results inform the design of multimodal GenAI interfaces that better support iterative design workflows, reduce switching friction, and align with professional design practices, suggesting a path toward more integrated, task-aware refinement tools.

Abstract

Generative AI (GenAI) image tools are increasingly used in design practice, enabling rapid ideation but offering limited support for refinement tasks such as adjusting layout, scale, or visual attributes. While text prompts and inpainting allow localized edits, they often remain inefficient or ambiguous for precise, in-context, and iterative refinement -- motivating the exploration of alternative methods. This work examines how pen-based scribbles and annotations can enhance GenAI image refinement. A formative study with seven professional designers informed a prototype supporting three input modalities: text-only, visual-only, and combined prompting. A within-subjects study with 30 designers and design students compared these modalities across closed- and open-ended tasks, evaluating expressiveness, efficiency, workload, user experience, iteration, and multimodal strategies. Visual prompts improved clarity and speed for spatial edits while reducing workload, whereas text remained effective for semantic and global changes. The combined modality received the highest overall ratings, enabling complementary use, balancing spatial precision with semantic detail, and supporting smoother iteration. Task-specific preferences also emerged: adding new objects often required both modalities, while moving or modifying elements was typically handled through visual input. This work contributes (1) an empirical comparison of multimodal prompting for GenAI refinement, (2) a prototype integrating scribbles and annotations, and (3) insights into designers' multimodal strategies to inform future GenAI interfaces that better support refinement in GenAI-supported design workflows.

Enhancing Generative AI Image Refinement with Scribbles and Annotations: A Comparative Study of Multimodal Prompts

TL;DR

This study addresses the refinement gap in GenAI image tools by introducing pen-based scribbles and annotations as actionable prompts. Through a formative study and a within-subjects user study with 30 designers and design students, the authors compare text-only, visual-only, and combined prompting across spatial, semantic, and iterative refinement tasks. Key findings show visual prompts improve spatial clarity and speed, text prompts excel for semantic control, and the combined modality yields the highest overall satisfaction by enabling complementary strategies. The results inform the design of multimodal GenAI interfaces that better support iterative design workflows, reduce switching friction, and align with professional design practices, suggesting a path toward more integrated, task-aware refinement tools.

Abstract

Generative AI (GenAI) image tools are increasingly used in design practice, enabling rapid ideation but offering limited support for refinement tasks such as adjusting layout, scale, or visual attributes. While text prompts and inpainting allow localized edits, they often remain inefficient or ambiguous for precise, in-context, and iterative refinement -- motivating the exploration of alternative methods. This work examines how pen-based scribbles and annotations can enhance GenAI image refinement. A formative study with seven professional designers informed a prototype supporting three input modalities: text-only, visual-only, and combined prompting. A within-subjects study with 30 designers and design students compared these modalities across closed- and open-ended tasks, evaluating expressiveness, efficiency, workload, user experience, iteration, and multimodal strategies. Visual prompts improved clarity and speed for spatial edits while reducing workload, whereas text remained effective for semantic and global changes. The combined modality received the highest overall ratings, enabling complementary use, balancing spatial precision with semantic detail, and supporting smoother iteration. Task-specific preferences also emerged: adding new objects often required both modalities, while moving or modifying elements was typically handled through visual input. This work contributes (1) an empirical comparison of multimodal prompting for GenAI refinement, (2) a prototype integrating scribbles and annotations, and (3) insights into designers' multimodal strategies to inform future GenAI interfaces that better support refinement in GenAI-supported design workflows.
Paper Structure (70 sections, 13 figures, 9 tables)

This paper contains 70 sections, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Overview of the three input modalities in the prototype: (A) Text-only, using typed prompts with inpainting-based region selection; (B) Visual-only, using pen-based scribbles and annotations; and (C) Combination, using multimodal input integrating both text and visual inputs, which are each available for use as needed.
  • Figure 2: User interface of the multimodal prompting prototype in the Combination modality. (1) Central canvas supporting scribbles, annotations, and inpainting; (2) Toolbar with drawing and editing tools (pen, inpainting, eraser, canvas dragging, stroke width, color selection, undo/redo, and clear canvas); (3) Prompt panel for text input and submission; (4) History panel displaying previous generations for iterative refinement; (5) Visual prompt lexicon panel showing visual prompt examples; and (6) Task panel displaying source and target images, visible only during closed-ended tasks to illustrate refinement goals.
  • Figure 3: System pipeline of the prototype for image refinement. The process includes: (1) collecting text, visual prompts, and inpainting masks (Input Collection); (2) interpreting and merging them into a structured intent using GPT-4 or GPT-4o (Intent Interpretation); and (3) generating the refined image with either GPT-Image-1 for object additions or FLUX.1 Kontext Pro for contextual refinements (Image Generation).
  • Figure 4: Overview of the user study procedure. Each session (approximately 90 minutes) included an introduction, tutorial, three closed-ended tasks with alternating input modalities (Text-only, Visual-only, and Combination), and an open-ended task conducted exclusively in the Combination modality. Surveys followed each task, and the session concluded with an interview. The closed-ended phase was counterbalanced using a Latin square design with six sequences (example illustrated: Order 1).
  • Figure 5: Closed-ended tasks: (upper) three image sets were provided, each used with a different input modality depending on the assigned Latin-square group. Each set included (A) a source image, (B) the same image annotated with seven refinement tasks, (C) a corresponding target image, and (D) an overview of the seven task types. (lower) Example labeled source and target images from Sets 2 and 3. Image sets were generated in advance with DALL-E 3 and FLUX.1 to ensure all participants worked with identical materials.
  • ...and 8 more figures