Table of Contents
Fetching ...

SketchFlex: Facilitating Spatial-Semantic Coherence in Text-to-Image Generation with Region-Based Sketches

Haichuan Lin, Yilin Ye, Jiazhi Xia, Wei Zeng

TL;DR

SketchFlex tackles the challenge that non-experts face when steering text-to-image generation with precise spatial and semantic constraints. It combines sketch-based region inputs with automated prompt tuning via a semantic space and multimodal prompting, then refines rough sketches through a decompose-and-recompose workflow that isolates and aligns individual object shapes before anchoring them with shape-aware conditioning. The system demonstrates improved semantic coherence and spatial fidelity over end-to-end and region-based baselines, while reducing cognitive load for novices. These contributions enable more accessible, flexible, and user-intent–driven image generation with potential impact on design, art, and asset creation. The work also discusses limitations, ethical considerations, and directions for integrating more models and progressive sketching paradigms.

Abstract

Text-to-image models can generate visually appealing images from text descriptions. Efforts have been devoted to improving model controls with prompt tuning and spatial conditioning. However, our formative study highlights the challenges for non-expert users in crafting appropriate prompts and specifying fine-grained spatial conditions (e.g., depth or canny references) to generate semantically cohesive images, especially when multiple objects are involved. In response, we introduce SketchFlex, an interactive system designed to improve the flexibility of spatially conditioned image generation using rough region sketches. The system automatically infers user prompts with rational descriptions within a semantic space enriched by crowd-sourced object attributes and relationships. Additionally, SketchFlex refines users' rough sketches into canny-based shape anchors, ensuring the generation quality and alignment of user intentions. Experimental results demonstrate that SketchFlex achieves more cohesive image generations than end-to-end models, meanwhile significantly reducing cognitive load and better matching user intentions compared to region-based generation baseline.

SketchFlex: Facilitating Spatial-Semantic Coherence in Text-to-Image Generation with Region-Based Sketches

TL;DR

SketchFlex tackles the challenge that non-experts face when steering text-to-image generation with precise spatial and semantic constraints. It combines sketch-based region inputs with automated prompt tuning via a semantic space and multimodal prompting, then refines rough sketches through a decompose-and-recompose workflow that isolates and aligns individual object shapes before anchoring them with shape-aware conditioning. The system demonstrates improved semantic coherence and spatial fidelity over end-to-end and region-based baselines, while reducing cognitive load for novices. These contributions enable more accessible, flexible, and user-intent–driven image generation with potential impact on design, art, and asset creation. The work also discusses limitations, ethical considerations, and directions for integrating more models and progressive sketching paradigms.

Abstract

Text-to-image models can generate visually appealing images from text descriptions. Efforts have been devoted to improving model controls with prompt tuning and spatial conditioning. However, our formative study highlights the challenges for non-expert users in crafting appropriate prompts and specifying fine-grained spatial conditions (e.g., depth or canny references) to generate semantically cohesive images, especially when multiple objects are involved. In response, we introduce SketchFlex, an interactive system designed to improve the flexibility of spatially conditioned image generation using rough region sketches. The system automatically infers user prompts with rational descriptions within a semantic space enriched by crowd-sourced object attributes and relationships. Additionally, SketchFlex refines users' rough sketches into canny-based shape anchors, ensuring the generation quality and alignment of user intentions. Experimental results demonstrate that SketchFlex achieves more cohesive image generations than end-to-end models, meanwhile significantly reducing cognitive load and better matching user intentions compared to region-based generation baseline.

Paper Structure

This paper contains 43 sections, 3 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Failure cases of existing methods for rough sketch based image generation: a) missing object for the green sketch, b) wrong perspective of the man sitting on the bench, c) unnatural relationship as the man is not holding the umbrella, and d) unrealistic scenario for water in the carriage. (a)-(c) are generated by Dense Diffusion kim2023dense, (d) is generated by MultiDiffusion bar2023multidiffusion.
  • Figure 2: SketchFlex mainly consists of three components: (1) sketch-aware prompt recommendation that support users in crafting effective prompts for the rough sketch; (2) object shape refinement through single object decomposition and generation; and (3) spatial adjustment and anchoring of object shapes.
  • Figure 3: Our sketch-aware prompt recommendation first builds a semantic space through data-driven analysis of key semantic elements covering single object and cross object properties. Then the semantic space is integrated with retrieval of attributes and relationships reference from semantic dataset. Finally, these semantic guidance is combined with users' initial sketch to form a sketch-aware multi-modal prompt to the MLLM to support spatial-aware inference.
  • Figure 4: Spatial-condition sketch refinement can help novice users refine their sketch by generating more realistic and accurate sketch for each object through single object decomposition and generation, and subsequently allowing users to interactively refine the sketch by object selection and spatial adjustment.
  • Figure 5: Ablation study shows that prompt recommendation avoids common issues like missing objects and unrealistic relationships while sketch refinement further enhances fine-grained control.
  • ...and 7 more figures