Table of Contents
Fetching ...

UIClip: A Data-driven Model for Assessing User Interface Design

Jason Wu, Yi-Hao Peng, Amanda Li, Amanda Swearngin, Jeffrey P. Bigham, Jeffrey Nichols

TL;DR

UIClip addresses the challenge of objective, scalable UI design quality assessment by linking screenshots with natural-language descriptions through a CLIP-based framework. It introduces JitterWeb, a 2.3 million-example synthetic dataset of jittered UIs, and BetterApp, a 1.2K-rated human dataset, to train and validate design quality scoring and surface design suggestions. Across extensive benchmarks, UIClip outperforms large vision-language baselines in design quality, design suggestion accuracy, and design relevance, while maintaining a compact model size. The work demonstrates practical applications in UI code generation, design guidance, and example retrieval, and commits to releasing training data and code to foster further research in data-driven UI design evaluation.

Abstract

User interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmentation, and human ratings to construct a large-scale dataset of UIs, collated by description and ranked by design quality. Through training on the dataset, UIClip implicitly learns properties of good and bad designs by i) assigning a numerical score that represents a UI design's relevance and quality and ii) providing design suggestions. In an evaluation that compared the outputs of UIClip and other baselines to UIs rated by 12 human designers, we found that UIClip achieved the highest agreement with ground-truth rankings. Finally, we present three example applications that demonstrate how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality: i) UI code generation, ii) UI design tips generation, and iii) quality-aware UI example search.

UIClip: A Data-driven Model for Assessing User Interface Design

TL;DR

UIClip addresses the challenge of objective, scalable UI design quality assessment by linking screenshots with natural-language descriptions through a CLIP-based framework. It introduces JitterWeb, a 2.3 million-example synthetic dataset of jittered UIs, and BetterApp, a 1.2K-rated human dataset, to train and validate design quality scoring and surface design suggestions. Across extensive benchmarks, UIClip outperforms large vision-language baselines in design quality, design suggestion accuracy, and design relevance, while maintaining a compact model size. The work demonstrates practical applications in UI code generation, design guidance, and example retrieval, and commits to releasing training data and code to foster further research in data-driven UI design evaluation.

Abstract

User interface (UI) design is a difficult yet important task for ensuring the usability, accessibility, and aesthetic qualities of applications. In our paper, we develop a machine-learned model, UIClip, for assessing the design quality and visual relevance of a UI given its screenshot and natural language description. To train UIClip, we used a combination of automated crawling, synthetic augmentation, and human ratings to construct a large-scale dataset of UIs, collated by description and ranked by design quality. Through training on the dataset, UIClip implicitly learns properties of good and bad designs by i) assigning a numerical score that represents a UI design's relevance and quality and ii) providing design suggestions. In an evaluation that compared the outputs of UIClip and other baselines to UIs rated by 12 human designers, we found that UIClip achieved the highest agreement with ground-truth rankings. Finally, we present three example applications that demonstrate how UIClip can facilitate downstream applications that rely on instantaneous assessment of UI design quality: i) UI code generation, ii) UI design tips generation, and iii) quality-aware UI example search.
Paper Structure (39 sections, 2 equations, 9 figures, 2 tables)

This paper contains 39 sections, 2 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Our synthetic dataset was comprised of UIs that were processed by jitter functions to introduce design defects. In this figure, we visualize the effect of each jitter function independently, although up to three jitter functions can be applied simultaneously. Our crawler captures screenshots for multiple devices (desktop, tablet, and mobile), but due to space constraints, we only show rendered mobile examples.
  • Figure 2: Our process for generating text descriptions for JitterWeb. Based on a set of randomly-chosen jitter functions, several design defects are introduced, e.g., color swap, color noise, font swap. These design defects are recorded as a part of the jittered UI's caption, which helps our model associate design defects with the UI screenshot.
  • Figure 3: A screenshot of the application used for collecting human design ratings. Participants first decide whether the pair of screenshots can be described by a single caption (A). If possible, an improved caption is authored (B). Participants select one option that better matches the caption (C) and provide their reasons for doing so (D).
  • Figure 4: Model performance on design choice prediction, which involves identifying the preferred UI screenshot from a pair. UIClip models (with bold font) perform the best on held-out human-rated pairs from BetterApp and synthetically-generated pairs from JitterWeb. Most baselines perform poorly, around the level of random chance.
  • Figure 5: Model performance on design suggestion prediction, which involves generating design suggestions for a UI based on detected design flaws. We used the macro-averaged F1 score to measure performance across four CRAP principles. In addition, we introduce a choice-adjusted metric that ignores generated suggestions if they led to the incorrect choice. Using both metrics, UIClip models (with bold font) perform the best on held-out pairs from BetterApp and JitterWeb.
  • ...and 4 more figures