Table of Contents
Fetching ...

Improving User Interface Generation Models from Designer Feedback

Jason Wu, Amanda Swearngin, Arun Krishna Vajjala, Alan Leung, Jeffrey Nichols, Titus Barik

TL;DR

This paper investigates several approaches for designers to give feedback to UI generation models, using familiar interactions such as commenting, sketching and direct manipulation, and finds that designer-aligned approaches outperform models trained with traditional ranking feedback and all tested baselines, including GPT-5.

Abstract

Despite being trained on vast amounts of data, most LLMs are unable to reliably generate well-designed UIs. Designer feedback is essential to improving performance on UI generation; however, we find that existing RLHF methods based on ratings or rankings are not well-aligned with with designers' workflows and ignore the rich rationale used to critique and improve UI designs. In this paper, we investigate several approaches for designers to give feedback to UI generation models, using familiar interactions such as commenting, sketching and direct manipulation. We first perform an evaluation with 21 designers where they gave feedback using these interactions, which resulted in 1500 design annotations. We then use this data to finetune a series of LLMs to generate higher quality UIs. Finally, we evaluate these models with human judges, and we find that our designer-aligned approaches outperform models trained with traditional ranking feedback and all tested baselines, including GPT-5.

Improving User Interface Generation Models from Designer Feedback

TL;DR

This paper investigates several approaches for designers to give feedback to UI generation models, using familiar interactions such as commenting, sketching and direct manipulation, and finds that designer-aligned approaches outperform models trained with traditional ranking feedback and all tested baselines, including GPT-5.

Abstract

Despite being trained on vast amounts of data, most LLMs are unable to reliably generate well-designed UIs. Designer feedback is essential to improving performance on UI generation; however, we find that existing RLHF methods based on ratings or rankings are not well-aligned with with designers' workflows and ignore the rich rationale used to critique and improve UI designs. In this paper, we investigate several approaches for designers to give feedback to UI generation models, using familiar interactions such as commenting, sketching and direct manipulation. We first perform an evaluation with 21 designers where they gave feedback using these interactions, which resulted in 1500 design annotations. We then use this data to finetune a series of LLMs to generate higher quality UIs. Finally, we evaluate these models with human judges, and we find that our designer-aligned approaches outperform models trained with traditional ranking feedback and all tested baselines, including GPT-5.

Paper Structure

This paper contains 41 sections, 3 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Figure shows the four interfaces we developed to collect feedback from designers. The ranking interface (Far Left) allows users to select the better of two UI screenshots through a binary response. The commenting interface (Center Left) allows users to write a list of natural language critiques or comments for a UI screenshot. The sketch interface (Center Right) allows users to draw annotations (boxes and points) on a UI screenshot and associate them with textual comments. Designers used the Sketch design software (Far Right) to make direct edits to model-generated UIs, which were first converted into the appropriate format. The commenting, sketching, and revising interfaces and inspired by interactions identified by Hartmann et al hartmann2010d.
  • Figure 2: Rating scores (Top) and average win rate (Bottom) of models in our feedback fine-tuning evaluation. We computed the rating scores by using the LMSYS calculation methodology chiang2024chatbotzheng2023chatbotarena, and higher scores indicate models that were more often preferred by human judges. Bars show the median score and 95% confidence intervals generated using bootstrap sampling.
  • Figure 3: Rating scores (Top) and average win rate (Bottom) of models in our model generalization evaluation. We computed the rating scores by using the LMSYS calculation methodology chiang2024chatbotzheng2023chatbotarena, and higher scores indicate models that were more often preferred by human judges. Bars show the median score and 95% confidence intervals generated using bootstrap sampling.
  • Figure 4: Figure shows rendered output of six models tested in the feedback fine-tuning evaluation. We rendered model outputs for five randomly sampled text descriptions from our evaluation set.
  • Figure 5: Figure shows rendered output of six models tested in the model generalization evaluation. We rendered model outputs for five randomly sampled text descriptions from our evaluation set.
  • ...and 1 more figures