Table of Contents
Fetching ...

Generating Automatic Feedback on UI Mockups with Large Language Models

Peitong Duan, Jeremy Warner, Yang Li, Bjoern Hartmann

TL;DR

This work focuses on applying GPT-4 to automate heuristic evaluation, which currently entails a human expert assessing a UI’s compliance with a set of design guidelines, and implemented a Figma plugin that renders automatically-generated feedback as constructive suggestions.

Abstract

Feedback on user interface (UI) mockups is crucial in design. However, human feedback is not always readily available. We explore the potential of using large language models for automatic feedback. Specifically, we focus on applying GPT-4 to automate heuristic evaluation, which currently entails a human expert assessing a UI's compliance with a set of design guidelines. We implemented a Figma plugin that takes in a UI design and a set of written heuristics, and renders automatically-generated feedback as constructive suggestions. We assessed performance on 51 UIs using three sets of guidelines, compared GPT-4-generated design suggestions with those from human experts, and conducted a study with 12 expert designers to understand fit with existing practice. We found that GPT-4-based feedback is useful for catching subtle errors, improving text, and considering UI semantics, but feedback also decreased in utility over iterations. Participants described several uses for this plugin despite its imperfect suggestions.

Generating Automatic Feedback on UI Mockups with Large Language Models

TL;DR

This work focuses on applying GPT-4 to automate heuristic evaluation, which currently entails a human expert assessing a UI’s compliance with a set of design guidelines, and implemented a Figma plugin that renders automatically-generated feedback as constructive suggestions.

Abstract

Feedback on user interface (UI) mockups is crucial in design. However, human feedback is not always readily available. We explore the potential of using large language models for automatic feedback. Specifically, we focus on applying GPT-4 to automate heuristic evaluation, which currently entails a human expert assessing a UI's compliance with a set of design guidelines. We implemented a Figma plugin that takes in a UI design and a set of written heuristics, and renders automatically-generated feedback as constructive suggestions. We assessed performance on 51 UIs using three sets of guidelines, compared GPT-4-generated design suggestions with those from human experts, and conducted a study with 12 expert designers to understand fit with existing practice. We found that GPT-4-based feedback is useful for catching subtle errors, improving text, and considering UI semantics, but feedback also decreased in utility over iterations. Participants described several uses for this plugin despite its imperfect suggestions.
Paper Structure (55 sections, 11 figures, 1 table)

This paper contains 55 sections, 11 figures, 1 table.

Figures (11)

  • Figure 1: Illustration of plugin interactions that contextualize text feedback with the UI. "A" shows that clicking on a link in the violation text selects the corresponding group or element in the Figma mockup and Layers panel. "B" shows the "click to focus" feature, where clicking on a violation fades the other violations and draws a box around the corresponding group in the UI screenshot. "C" illustrates that hovering over a group or element link draws a blue box around the corresponding element in the screenshot. "D" points out that clicking on the 'X' icon of a violation hides it and adds this feedback to the LLM prompt for the next round of evaluation.
  • Figure 2: Our LLM-based plugin system architecture. The designer prototypes a UI in Figma (Box 1), and the plugin generates a UI representation to send to an LLM (3). The designer also selects heuristics/guidelines to use for evaluating the prototype (2), and a prompt containing the UI representation (in JSON) and guidelines is created and sent to the LLM (4). After identifying all the guideline violations, another LLM query is made to rephrase the guideline violations into constructive design advice (4). The LLM response is then programmatically parsed (5), and the plugin produces an interpretable representation of the response to display (6). The designer dismisses incorrect suggestions, which are incorporated in the LLM prompt for the next round of evaluation, if there is room in the context window (7).
  • Figure 3: An example portion of a UI JSON. It has a tree structure, where each node has a list of child nodes (the "children" field). Each node in this JSON is color-coded with its corresponding group or element in the UI screenshot. The node named "lyft event photo and logo" is a group ("type: GROUP") consisting of a photo of the live chat event ("lyft live chat event photo") and the Lyft logo ("lyft logo"). The JSON node for the photo contains its location information ("bounds"), type ("IMAGE"), and unique identifier ("id"). The JSON node for "lyft logo" contains its location and some stylistic information, like the stroke color and stroke weight for its white border.
  • Figure 4: An illustration of the formats of the three studies. The Performance Study consists of 3 raters evaluating the accuracy and helpfulness of GPT-4-generated suggestions for 51 UI mockups. The Heuristic Evaluation Study with Human Experts consists of 12 design experts, who each looked for guideline violations in 6 UIs, and finishes with an interview asking them to compare their violations with those found by the LLM. Finally, the Iterative Usage study comprises of another group of 12 design experts, each working with 3 UI mockups. For each mockup, the expert iteratively revises the design based on the LLM's valid suggestions and rates the LLM's feedback, going through 2-3 rounds of this per UI. The Usage study concludes with an interview about the expert's experience with the tool.
  • Figure 5: Histogram showing the number of ratings in each category for accuracy and helpfulness, from the 3 participants in the Performance Study. For accuracy, the scale is: "1 - not accurate", "2 - partially accurate", and "3 - accurate". The scale for helpfulness ranges from "1 - not at all helpful" to "5 - very helpful". The rating data is also visualized as horizontal bar charts for this study and the Usage Study.
  • ...and 6 more figures