Table of Contents
Fetching ...

pix2code: Generating Code from a Graphical User Interface Screenshot

Tony Beltramelli

TL;DR

pix2code introduces an end-to-end CNN-LSTM framework that generates DSL tokens representing GUI code from a single GUI screenshot. It unifies a vision encoder, a token-based language model, and a decoder to produce platform-agnostic code, demonstrated on synthesized iOS, Android, and web datasets. The work release of synthetic GUI-code datasets and public implementation underscores its potential and highlights limitations such as reliance on one-hot DSL tokens and the need for attention or embeddings to improve accuracy. The findings suggest promising directions for scaling up with more data, richer architectures, and alternative data sources like web crawls to automate GUI implementation across platforms.

Abstract

Transforming a graphical user interface screenshot created by a designer into computer code is a typical task conducted by a developer in order to build customized software, websites, and mobile applications. In this paper, we show that deep learning methods can be leveraged to train a model end-to-end to automatically generate code from a single input image with over 77% of accuracy for three different platforms (i.e. iOS, Android and web-based technologies).

pix2code: Generating Code from a Graphical User Interface Screenshot

TL;DR

pix2code introduces an end-to-end CNN-LSTM framework that generates DSL tokens representing GUI code from a single GUI screenshot. It unifies a vision encoder, a token-based language model, and a decoder to produce platform-agnostic code, demonstrated on synthesized iOS, Android, and web datasets. The work release of synthetic GUI-code datasets and public implementation underscores its potential and highlights limitations such as reliance on one-hot DSL tokens and the need for attention or embeddings to improve accuracy. The findings suggest promising directions for scaling up with more data, richer architectures, and alternative data sources like web crawls to automate GUI implementation across platforms.

Abstract

Transforming a graphical user interface screenshot created by a designer into computer code is a typical task conducted by a developer in order to build customized software, websites, and mobile applications. In this paper, we show that deep learning methods can be leveraged to train a model end-to-end to automatically generate code from a single input image with over 77% of accuracy for three different platforms (i.e. iOS, Android and web-based technologies).

Paper Structure

This paper contains 10 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of the pix2code model architecture. During training, the GUI image is encoded by a CNN-based vision model; the context (i.e. a sequence of one-hot encoded tokens corresponding to DSL code) is encoded by a language model consisting of a stack of LSTM layers. The two resulting feature vectors are then concatenated and fed into a second stack of LSTM layers acting as a decoder. Finally, a softmax layer is used to sample one token at a time; the output size of the softmax layer corresponding to the DSL vocabulary size. Given an image and a sequence of tokens, the model (i.e. contained in the gray box) is differentiable and can thus be optimized end-to-end through gradient descent to predict the next token in the sequence. During sampling, the input context is updated for each prediction to contain the last predicted token. The resulting sequence of DSL tokens is compiled to the desired target language using traditional compiler design techniques.
  • Figure 2: An example of a native iOS GUI written in our markup-like DSL.
  • Figure 3: Training loss on different datasets and ROC curves calculated during sampling with the model trained for 10 epochs.
  • Figure 4: Experiment samples for the iOS GUI dataset.
  • Figure 5: Experiment samples from the web-based GUI dataset.
  • ...and 1 more figures