UICoder: Finetuning Large Language Models to Generate User Interface Code through Automated Feedback
Jason Wu, Eldon Schoop, Alan Leung, Titus Barik, Jeffrey P. Bigham, Jeffrey Nichols
TL;DR
This work tackles the challenge of reliably generating UI code from natural language descriptions by replacing costly human feedback with automated signals from a code compiler and vision-language models. It introduces UICoder, a self-improving pipeline that starts from an open-source LLM and iteratively self-generates data, filters it with compilation and CLIP-based scoring, and finetunes the model, completing five iterations to produce nearly one million SwiftUI samples. The authors compare three UICoder variants against strong baselines, including proprietary models, and show substantial gains that close the gap to larger models. They also demonstrate the utility of distilling their synthetic data into other models, highlighting practical benefits for open-model deployment. Overall, the approach enables scalable, automated specialization of LLMs for UI code generation, reducing reliance on human annotations and proprietary data while delivering competitive performance.
Abstract
Large language models (LLMs) struggle to consistently generate UI code that compiles and produces visually relevant designs. Existing approaches to improve generation rely on expensive human feedback or distilling a proprietary model. In this paper, we explore the use of automated feedback (compilers and multi-modal models) to guide LLMs to generate high-quality UI code. Our method starts with an existing LLM and iteratively produces improved models by self-generating a large synthetic dataset using an original model, applying automated tools to aggressively filter, score, and de-duplicate the data into a refined higher quality dataset. The original LLM is improved by finetuning on this refined dataset. We applied our approach to several open-source LLMs and compared the resulting performance to baseline models with both automated metrics and human preferences. Our evaluation shows the resulting models outperform all other downloadable baselines and approach the performance of larger proprietary models.
