Table of Contents
Fetching ...

WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs

Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Bohua Chen, Dongping Chen, Siyuan Wu, Xing Zhou, Wenbin Jiang, Hai Jin, Xiangliang Zhang

TL;DR

This work tackles the challenge of generating webpage code from designs by addressing the scarcity of real-world training data. It introduces WebCode2M, a large-scale dataset with 2.56 million real-world design-image and HTML/CSS triplets plus layout annotations, derived through a Common Crawl–based pipeline with purification, rendering, and neural scoring. A Vision Transformer–based baseline, WebCoder, is trained on WebCode2M and evaluated with a new TreeBLEU metric, showing clear improvements in both visual fidelity and structural DOM-tree recall over prior datasets and baselines. The paper also provides extensive benchmarking against a wide range of baselines, discusses practical challenges, and offers open-source resources to advance practical webpage code generation research and tooling.

Abstract

Automatically generating webpage code from webpage designs can significantly reduce the workload of front-end developers, and recent Multimodal Large Language Models (MLLMs) have shown promising potential in this area. However, our investigation reveals that most existing MLLMs are constrained by the absence of high-quality, large-scale, real-world datasets, resulting in inadequate performance in automated webpage code generation. To fill this gap, this paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details. Sourced from real-world web resources, WebCode2M offers a rich and valuable dataset for webpage code generation across a variety of applications. The dataset quality is ensured by a scoring model that filters out instances with aesthetic deficiencies or other incomplete elements. To validate the effectiveness of WebCode2M, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison. Additionally, we introduce a new metric, TreeBLEU, to measure the structural hierarchy recall. The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs, confirming its effectiveness and usability for future applications in front-end design tools. Finally, we highlight several practical challenges introduced by our dataset, calling for further research. The code and dataset are publicly available at our project homepage: https://webcode2m.github.io.

WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs

TL;DR

This work tackles the challenge of generating webpage code from designs by addressing the scarcity of real-world training data. It introduces WebCode2M, a large-scale dataset with 2.56 million real-world design-image and HTML/CSS triplets plus layout annotations, derived through a Common Crawl–based pipeline with purification, rendering, and neural scoring. A Vision Transformer–based baseline, WebCoder, is trained on WebCode2M and evaluated with a new TreeBLEU metric, showing clear improvements in both visual fidelity and structural DOM-tree recall over prior datasets and baselines. The paper also provides extensive benchmarking against a wide range of baselines, discusses practical challenges, and offers open-source resources to advance practical webpage code generation research and tooling.

Abstract

Automatically generating webpage code from webpage designs can significantly reduce the workload of front-end developers, and recent Multimodal Large Language Models (MLLMs) have shown promising potential in this area. However, our investigation reveals that most existing MLLMs are constrained by the absence of high-quality, large-scale, real-world datasets, resulting in inadequate performance in automated webpage code generation. To fill this gap, this paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details. Sourced from real-world web resources, WebCode2M offers a rich and valuable dataset for webpage code generation across a variety of applications. The dataset quality is ensured by a scoring model that filters out instances with aesthetic deficiencies or other incomplete elements. To validate the effectiveness of WebCode2M, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison. Additionally, we introduce a new metric, TreeBLEU, to measure the structural hierarchy recall. The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs, confirming its effectiveness and usability for future applications in front-end design tools. Finally, we highlight several practical challenges introduced by our dataset, calling for further research. The code and dataset are publicly available at our project homepage: https://webcode2m.github.io.
Paper Structure (21 sections, 1 equation, 8 figures, 5 tables)

This paper contains 21 sections, 1 equation, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The pipeline of constructing the WebCode2M dataset.
  • Figure 2: Representative screenshots of webpages in WebCode2M and other datasets.
  • Figure 3: Score distributions of annotators in two groups.
  • Figure 4: Score distribution of the manually annotated subset (inner ring) and the entire dataset (outer ring) before score-based filtering.
  • Figure 5: Length density of the WebCode2M dataset.
  • ...and 3 more figures