WebCode2M: A Real-World Dataset for Code Generation from Webpage Designs
Yi Gui, Zhen Li, Yao Wan, Yemin Shi, Hongyu Zhang, Yi Su, Bohua Chen, Dongping Chen, Siyuan Wu, Xing Zhou, Wenbin Jiang, Hai Jin, Xiangliang Zhang
TL;DR
This work tackles the challenge of generating webpage code from designs by addressing the scarcity of real-world training data. It introduces WebCode2M, a large-scale dataset with 2.56 million real-world design-image and HTML/CSS triplets plus layout annotations, derived through a Common Crawl–based pipeline with purification, rendering, and neural scoring. A Vision Transformer–based baseline, WebCoder, is trained on WebCode2M and evaluated with a new TreeBLEU metric, showing clear improvements in both visual fidelity and structural DOM-tree recall over prior datasets and baselines. The paper also provides extensive benchmarking against a wide range of baselines, discusses practical challenges, and offers open-source resources to advance practical webpage code generation research and tooling.
Abstract
Automatically generating webpage code from webpage designs can significantly reduce the workload of front-end developers, and recent Multimodal Large Language Models (MLLMs) have shown promising potential in this area. However, our investigation reveals that most existing MLLMs are constrained by the absence of high-quality, large-scale, real-world datasets, resulting in inadequate performance in automated webpage code generation. To fill this gap, this paper introduces WebCode2M, a new dataset comprising 2.56 million instances, each containing a design image along with the corresponding webpage code and layout details. Sourced from real-world web resources, WebCode2M offers a rich and valuable dataset for webpage code generation across a variety of applications. The dataset quality is ensured by a scoring model that filters out instances with aesthetic deficiencies or other incomplete elements. To validate the effectiveness of WebCode2M, we introduce a baseline model based on the Vision Transformer (ViT), named WebCoder, and establish a benchmark for fair comparison. Additionally, we introduce a new metric, TreeBLEU, to measure the structural hierarchy recall. The benchmarking results demonstrate that our dataset significantly improves the ability of MLLMs to generate code from webpage designs, confirming its effectiveness and usability for future applications in front-end design tools. Finally, we highlight several practical challenges introduced by our dataset, calling for further research. The code and dataset are publicly available at our project homepage: https://webcode2m.github.io.
