pix2code: Generating Code from a Graphical User Interface Screenshot
Tony Beltramelli
TL;DR
pix2code introduces an end-to-end CNN-LSTM framework that generates DSL tokens representing GUI code from a single GUI screenshot. It unifies a vision encoder, a token-based language model, and a decoder to produce platform-agnostic code, demonstrated on synthesized iOS, Android, and web datasets. The work release of synthetic GUI-code datasets and public implementation underscores its potential and highlights limitations such as reliance on one-hot DSL tokens and the need for attention or embeddings to improve accuracy. The findings suggest promising directions for scaling up with more data, richer architectures, and alternative data sources like web crawls to automate GUI implementation across platforms.
Abstract
Transforming a graphical user interface screenshot created by a designer into computer code is a typical task conducted by a developer in order to build customized software, websites, and mobile applications. In this paper, we show that deep learning methods can be leveraged to train a model end-to-end to automatically generate code from a single input image with over 77% of accuracy for three different platforms (i.e. iOS, Android and web-based technologies).
