Table of Contents
Fetching ...

UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback

Jason Wu, Eldon Schoop, Alan Leung, Titus Barik, Jeffrey P. Bigham, Jeffrey Nichols

TL;DR

This work tackles the challenge of reliably generating UI code from natural language descriptions by replacing costly human feedback with automated signals from a code compiler and vision-language models. It introduces UICoder, a self-improving pipeline that starts from an open-source LLM and iteratively self-generates data, filters it with compilation and CLIP-based scoring, and finetunes the model, completing five iterations to produce nearly one million SwiftUI samples. The authors compare three UICoder variants against strong baselines, including proprietary models, and show substantial gains that close the gap to larger models. They also demonstrate the utility of distilling their synthetic data into other models, highlighting practical benefits for open-model deployment. Overall, the approach enables scalable, automated specialization of LLMs for UI code generation, reducing reliance on human annotations and proprietary data while delivering competitive performance.

Abstract

Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.

UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback

TL;DR

This work tackles the challenge of reliably generating UI code from natural language descriptions by replacing costly human feedback with automated signals from a code compiler and vision-language models. It introduces UICoder, a self-improving pipeline that starts from an open-source LLM and iteratively self-generates data, filters it with compilation and CLIP-based scoring, and finetunes the model, completing five iterations to produce nearly one million SwiftUI samples. The authors compare three UICoder variants against strong baselines, including proprietary models, and show substantial gains that close the gap to larger models. They also demonstrate the utility of distilling their synthetic data into other models, highlighting practical benefits for open-model deployment. Overall, the approach enables scalable, automated specialization of LLMs for UI code generation, reducing reliance on human annotations and proprietary data while delivering competitive performance.

Abstract

Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.
Paper Structure (21 sections, 6 figures, 3 tables)

This paper contains 21 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: A flow chart showing an overview of the multi-step training process, including a base model, supervised-tuned model, and an aligned model.
  • Figure 2: A plot of two automatically calculated metrics over time (on a held-out set): compilation rate and mean CLIP score. Over the course of training, our model improves metrics used to filter its training data.
  • Figure 3: Matrix shows the predicted win probability of model A against model B. Our training technique significantly improved the performance of an initially poorly-performing base model (StarChat) to competitive among larger proprietary models (UICoder).
  • Figure 4: Screenshots rendered from SwiftUI code generated by our models. For illustration purposes we manually included stock photos and icons. The model-generated code was not modified in any way except to update image asset names.
  • Figure 5: We demonstrate limitations of our approach through four types of failure cases observed in generated data. Note that all icons and images in these samples were replaced with placeholders.
  • ...and 1 more figures