Table of Contents
Fetching ...

CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

Daeheon Jeong, Seoyeon Byun, Kihoon Son, Dae Hyun Kim, Juho Kim

TL;DR

CANVAS introduces the first benchmark for evaluating vision-language systems operating tool-based UI design within design software, bridging a crucial gap between code/image generation and interactive design workflows. It assembles 598 tool-driven tasks drawn from 3,327 mobile UIs across 30 categories, split into replication and modification settings, and evaluates outputs with a hierarchical suite of perceptual metrics (SSIM, saliency, BLIP) plus component-wise similarity. The study finds that high-performing, tool-using agents exhibit diverse, strategic tool invocation in replication and precise tool selection in modification, with human preferences aligning with the proposed metrics. Error analysis highlights geometric, layout, and text operation failures that guide future improvements toward more reliable, human-aligned tool-based design automation.

Abstract

User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs' potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.

CANVAS: A Benchmark for Vision-Language Models on Tool-Based User Interface Design

TL;DR

CANVAS introduces the first benchmark for evaluating vision-language systems operating tool-based UI design within design software, bridging a crucial gap between code/image generation and interactive design workflows. It assembles 598 tool-driven tasks drawn from 3,327 mobile UIs across 30 categories, split into replication and modification settings, and evaluates outputs with a hierarchical suite of perceptual metrics (SSIM, saliency, BLIP) plus component-wise similarity. The study finds that high-performing, tool-using agents exhibit diverse, strategic tool invocation in replication and precise tool selection in modification, with human preferences aligning with the proposed metrics. Error analysis highlights geometric, layout, and text operation failures that guide future improvements toward more reliable, human-aligned tool-based design automation.

Abstract

User interface (UI) design is an iterative process in which designers progressively refine their work with design software such as Figma or Sketch. Recent advances in vision language models (VLMs) with tool invocation suggest these models can operate design software to edit a UI design through iteration. Understanding and enhancing this capacity is important, as it highlights VLMs' potential to collaborate with designers within conventional software. However, as no existing benchmark evaluates tool-based design performance, the capacity remains unknown. To address this, we introduce CANVAS, a benchmark for VLMs on tool-based user interface design. Our benchmark contains 598 tool-based design tasks paired with ground-truth references sampled from 3.3K mobile UI designs across 30 function-based categories (e.g., onboarding, messaging). In each task, a VLM updates the design step-by-step through context-based tool invocations (e.g., create a rectangle as a button background), linked to design software. Specifically, CANVAS incorporates two task types: (i) design replication evaluates the ability to reproduce a whole UI screen; (ii) design modification evaluates the ability to modify a specific part of an existing screen. Results suggest that leading models exhibit more strategic tool invocations, improving design quality. Furthermore, we identify common error patterns models exhibit, guiding future work in enhancing tool-based design capabilities.

Paper Structure

This paper contains 89 sections, 12 equations, 36 figures, 18 tables.

Figures (36)

  • Figure 1: Overview.CANVAS evaluates a VLM's capability to (A) generate a UI design with tool invocations in two tasks: (B) design replication and (C) design modification.
  • Figure 2: Design Modification Tasks. Design modification task measures a model’s capacity to perform targeted edits, requiring it to (A) adjust a component’s attributes, (B) insert a new component, or (C) switch the overall color scheme.
  • Figure 3: CANVAS Data Statistics: (A) Frequency of the five most frequent UI design types, with other categories grouped (see Appendix for full distribution). (B) The distribution of node tree depth is similar across the replication and modification sets with a Gaussian-like pattern. (C) The node count distribution is also similar across both sets. (D) The skewed frequency of node types per design indicates common patterns in component usage.
  • Figure 4: Tool Invocation Frequency: Average tool invocations per task $x \,axis$ ($n$). Colored blocks show tool types (e.g., creation includes create_rectangle). Higher-performing models exhibit greater tool diversity.
  • Figure 5: Error case 1. The models (A) miscount the markers, (B) draw irregular lines, and (C) create an inconsistent layout.
  • ...and 31 more figures