Table of Contents
Fetching ...

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation

Xinhao Huang, Jinke Yu, Wenhao Xu, Zeyi Wen, Ying Zhou, Junzhuo Liu, Junhao Ji, Zulong Chen

Abstract

While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a "holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. To bridge this gap, we propose DOne, an end-to-end framework that decouples structure understanding from element rendering. DOne introduces (1) a learned layout segmentation module to decompose complex designs, avoiding the limitations of heuristic cropping; (2) a specialized hybrid element retriever to handle the extreme aspect ratios and densities of UI components; and (3) a schema-guided generation paradigm that bridges layout and code. To rigorously assess performance, we introduce HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations on the HiFi2Code demonstrate that DOne outperforms exiting methods in both high-level visual similarity (e.g., over 10% in GPT Score) and fine-grained element alignment. Human evaluations confirm a 3 times productivity gain with higher visual fidelity.

DOne: Decoupling Structure and Rendering for High-Fidelity Design-to-Code Generation

Abstract

While Vision Language Models (VLMs) have shown promise in Design-to-Code generation, they suffer from a "holistic bottleneck-failing to reconcile high-level structural hierarchy with fine-grained visual details, often resulting in layout distortions or generic placeholders. To bridge this gap, we propose DOne, an end-to-end framework that decouples structure understanding from element rendering. DOne introduces (1) a learned layout segmentation module to decompose complex designs, avoiding the limitations of heuristic cropping; (2) a specialized hybrid element retriever to handle the extreme aspect ratios and densities of UI components; and (3) a schema-guided generation paradigm that bridges layout and code. To rigorously assess performance, we introduce HiFi2Code, a benchmark featuring significantly higher layout complexity than existing datasets. Extensive evaluations on the HiFi2Code demonstrate that DOne outperforms exiting methods in both high-level visual similarity (e.g., over 10% in GPT Score) and fine-grained element alignment. Human evaluations confirm a 3 times productivity gain with higher visual fidelity.

Paper Structure

This paper contains 49 sections, 4 equations, 10 figures, 9 tables, 2 algorithms.

Figures (10)

  • Figure 1: Reconstruction comparison (a-b) and illustrative examples of baseline failures (c-d).
  • Figure 2: The DOne Framework. DOne employs layout segmentation and visual element detection, followed by schema-guided code generation to ensure structural and visual fidelity.
  • Figure 3: Example of layout segmentation.
  • Figure 4: Example of visual element detection.
  • Figure 5: Pairwise win/tie/loss rates vs. S2C.
  • ...and 5 more figures