Table of Contents
Fetching ...

UICrit: Enhancing Automated Design Evaluation with a UICritique Dataset

Peitong Duan, Chin-yi Chen, Gang Li, Bjoern Hartmann, Yang Li

TL;DR

The paper addresses the gap between automated UI evaluation and human judgment by introducing UICrit, a targeted dataset of 3,059 expert design critiques and ratings for 983 mobile UI screens, enriched with bounding boxes and design-quality rubrics. It demonstrates that few-shot prompting and visual prompting using this dataset substantially improve LLM-generated UI feedback, achieving about a 55% gain over zero-shot baselines, validated through a design-expert user study. The authors show the dataset's utility for region-of-interest and full-screen feedback generation, discuss broader applications such as tool-agnostic UI evaluation and reward signals for UI generators, and propose scalable, future-facing directions including fine-tuning multi-modal LLMs and integrating with design tools. Overall, UICrit provides a practical path to more accurate automated UI critique and offers a scalable resource for improving generative UI models and evaluation workflows in design practice.

Abstract

Automated UI evaluation can be beneficial for the design process; for example, to compare different UI designs, or conduct automated heuristic evaluation. LLM-based UI evaluation, in particular, holds the promise of generalizability to a wide variety of UI types and evaluation tasks. However, current LLM-based techniques do not yet match the performance of human evaluators. We hypothesize that automatic evaluation can be improved by collecting a targeted UI feedback dataset and then using this dataset to enhance the performance of general-purpose LLMs. We present a targeted dataset of 3,059 design critiques and quality ratings for 983 mobile UIs, collected from seven experienced designers. We carried out an in-depth analysis to characterize the dataset's features. We then applied this dataset to achieve a 55% performance gain in LLM-generated UI feedback via various few-shot and visual prompting techniques. We also discuss future applications of this dataset, including training a reward model for generative UI techniques, and fine-tuning a tool-agnostic multi-modal LLM that automates UI evaluation.

UICrit: Enhancing Automated Design Evaluation with a UICritique Dataset

TL;DR

The paper addresses the gap between automated UI evaluation and human judgment by introducing UICrit, a targeted dataset of 3,059 expert design critiques and ratings for 983 mobile UI screens, enriched with bounding boxes and design-quality rubrics. It demonstrates that few-shot prompting and visual prompting using this dataset substantially improve LLM-generated UI feedback, achieving about a 55% gain over zero-shot baselines, validated through a design-expert user study. The authors show the dataset's utility for region-of-interest and full-screen feedback generation, discuss broader applications such as tool-agnostic UI evaluation and reward signals for UI generators, and propose scalable, future-facing directions including fine-tuning multi-modal LLMs and integrating with design tools. Overall, UICrit provides a practical path to more accurate automated UI critique and offers a scalable resource for improving generative UI models and evaluation workflows in design practice.

Abstract

Automated UI evaluation can be beneficial for the design process; for example, to compare different UI designs, or conduct automated heuristic evaluation. LLM-based UI evaluation, in particular, holds the promise of generalizability to a wide variety of UI types and evaluation tasks. However, current LLM-based techniques do not yet match the performance of human evaluators. We hypothesize that automatic evaluation can be improved by collecting a targeted UI feedback dataset and then using this dataset to enhance the performance of general-purpose LLMs. We present a targeted dataset of 3,059 design critiques and quality ratings for 983 mobile UIs, collected from seven experienced designers. We carried out an in-depth analysis to characterize the dataset's features. We then applied this dataset to achieve a 55% performance gain in LLM-generated UI feedback via various few-shot and visual prompting techniques. We also discuss future applications of this dataset, including training a reward model for generative UI techniques, and fine-tuning a tool-agnostic multi-modal LLM that automates UI evaluation.
Paper Structure (55 sections, 11 figures, 7 tables)

This paper contains 55 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: An illustration of the data collection process to obtain design comments, bounding boxes of critiqued screen regions, and design quality scores for 1000 UI screens.
  • Figure 2: The interface for design critique and rating annotation, with regions corresponding to each part of the annotation marked.
  • Figure 3: Examples of the data provided for each UI in the dataset. This figure shows worker annotations for a highly-rated UI, an average-rated UI, and a low-rated UI.
  • Figure 4: Histograms showing the counts for each numerical rating for aesthetics, usability, and overall design quality. The ratings all generally follow a normal distribution.
  • Figure 5: The results of K-means clustering by critique topic (left) and UI task category (right). Each cluster is labeled with its corresponding topic, which was determined through qualitative analysis.
  • ...and 6 more figures