Table of Contents
Fetching ...

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

Sukmin Yun, Haokun Lin, Rusiru Thushara, Mohammad Qazim Bhat, Yongxin Wang, Zutao Jiang, Mingkai Deng, Jinhong Wang, Tianhua Tao, Junbo Li, Haonan Li, Preslav Nakov, Timothy Baldwin, Zhengzhong Liu, Eric P. Xing, Xiaodan Liang, Zhiqiang Shen

TL;DR

Web2Code addresses the gap in multimodal LLMs' ability to understand webpage screenshots and translate them into HTML by introducing a large-scale, instruction-tuning webpage dataset and a two-part evaluation framework. The dataset combines newly generated image-code pairs, refined existing code data, and enriched webpage-understanding QA data, totaling around 1.18 million instruction entries with both synthetic and refined content. The evaluation framework (WUB and WCGB) assesses offline webpage understanding and online HTML-code generation fidelity by rendering outputs and using GPT-4V for scoring. Empirical results show that incorporating Web2Code data improves both webpage understanding and HTML generation across multiple backbones, while preserving general-domain capabilities, suggesting strong practical impact for web automation and UI prototyping.

Abstract

Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose $\texttt{Web2Code}$, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code are available at https://github.com/MBZUAI-LLM/web2code.

Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs

TL;DR

Web2Code addresses the gap in multimodal LLMs' ability to understand webpage screenshots and translate them into HTML by introducing a large-scale, instruction-tuning webpage dataset and a two-part evaluation framework. The dataset combines newly generated image-code pairs, refined existing code data, and enriched webpage-understanding QA data, totaling around 1.18 million instruction entries with both synthetic and refined content. The evaluation framework (WUB and WCGB) assesses offline webpage understanding and online HTML-code generation fidelity by rendering outputs and using GPT-4V for scoring. Empirical results show that incorporating Web2Code data improves both webpage understanding and HTML generation across multiple backbones, while preserving general-domain capabilities, suggesting strong practical impact for web automation and UI prototyping.

Abstract

Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose , a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code are available at https://github.com/MBZUAI-LLM/web2code.
Paper Structure (23 sections, 22 figures, 12 tables)

This paper contains 23 sections, 22 figures, 12 tables.

Figures (22)

  • Figure 1: Our motivation for constructing the Web2Code dataset stems from the limitations of previous models, such as LLaVA liu2023improved, which are trained on general datasets and struggle to generate high-quality webpages, as in the second row. Our dataset aims to significantly enhance the quality of webpage generation as in third row while maintaining a strong level of general multimodal ability.
  • Figure 2: Qualitative example of generated question-answer pair dataset. Questions cover diverse aspects of the web page understanding.
  • Figure 3: WebSRC data refinement for improved Quality. Left: Before refinement; Right: After refinement, the quality has been improved and duplications have been excluded.
  • Figure 4: Word Cloud for the answer set of the GPT4 based DWU dataset.
  • Figure 5: Distribution of most common 20 tags in GPT-3.5 based HTML data.
  • ...and 17 more figures