Table of Contents
Fetching ...

ControlGUI: Guiding Generative GUI Exploration through Perceptual Visual Flow

Aryan Garg, Yue Jiang, Antti Oulasvirta

Abstract

During the early stages of interface design, designers need to produce multiple sketches to explore a design space. Design tools often fail to support this critical stage, because they insist on specifying more details than necessary. Although recent advances in generative AI have raised hopes of solving this issue, in practice they fail because expressing loose ideas in a prompt is impractical. In this paper, we propose a diffusion-based approach to the low-effort generation of interface sketches. It breaks new ground by allowing flexible control of the generation process via three types of inputs: A) prompts, B) wireframes, and C) visual flows. The designer can provide any combination of these as input at any level of detail, and will get a diverse gallery of low-fidelity solutions in response. The unique benefit is that large design spaces can be explored rapidly with very little effort in input-specification. We present qualitative results for various combinations of input specifications. Additionally, we demonstrate that our model aligns more accurately with these specifications than other models.

ControlGUI: Guiding Generative GUI Exploration through Perceptual Visual Flow

Abstract

During the early stages of interface design, designers need to produce multiple sketches to explore a design space. Design tools often fail to support this critical stage, because they insist on specifying more details than necessary. Although recent advances in generative AI have raised hopes of solving this issue, in practice they fail because expressing loose ideas in a prompt is impractical. In this paper, we propose a diffusion-based approach to the low-effort generation of interface sketches. It breaks new ground by allowing flexible control of the generation process via three types of inputs: A) prompts, B) wireframes, and C) visual flows. The designer can provide any combination of these as input at any level of detail, and will get a diverse gallery of low-fidelity solutions in response. The unique benefit is that large design spaces can be explored rapidly with very little effort in input-specification. We present qualitative results for various combinations of input specifications. Additionally, we demonstrate that our model aligns more accurately with these specifications than other models.

Paper Structure

This paper contains 36 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Our diffusion-based model generates diverse low-fidelity GUIs by conditioning on both local and global properties. The model integrates Stable Diffusion with specialized adapters: the wireframe adapter manages local properties such as element positioning and element types specified on wireframes, while the flow adapter directs the overall visual flow of attention. Given inputs wireframes, prompts, and visual flow patterns, the model effectively produces varied GUI designs.
  • Figure 2: ControlGUI has the capability of generating diverse GUIs under different input conditions. a) With only a text prompt, the model generates diverse GUIs consistent with the given description. b,c,d) Adding a wireframe enforces structural fidelity, ensuring elements such as text, images, and buttons follow the specified layout. Results show that varying the wireframe or text prompt leads to distinct GUI topic and structural variations. e) Incorporating visual flow further constrains generation by guiding user attention across elements in the intended order. Scanpaths are visualized using a color gradient (green → blue) indicating temporal progression, with circles marking fixation points, closely matching the specified sequence.
  • Figure 3: Comparison of GUI generation results across baseline models and ControlGUI. Given the same text prompt and wireframe, ControlNet produces abstract layouts without meaningful semantic alignment. Stitch enforces structural coherence but results in overly simplistic and template-like designs. GPT-5 generates textually coherent websites, but layouts are rigid and lack stylistic diversity. In contrast, our model ControlGUI produces realistic, visually rich GUIs that simultaneously adhere to the wireframe structure, reflect the semantics of the prompt, and provide stylistic diversity suitable for early design exploration.
  • Figure 4: Examples of GUIs generated by participants during the user study. Each row illustrates a different design brief (left: text prompt + wireframe input, right: generated outputs). The examples span a range of application scenarios. Despite being trained on wireframes with rigid bounding boxes, the model is also able to handle hand-drawn wireframes, producing diverse and design variants aligned with user intent.
  • Figure 5: Participants’ subjective ratings of diversity, controllability, and creativity when comparing prompt-only input (baseline) against our multimodal system. Across both dimensions, participants rated our system significantly higher, showing the benefits of combining prompts with wireframes and visual flows to guide GUI generation.
  • ...and 1 more figures