Table of Contents
Fetching ...

Multimodal graph representation learning for website generation based on visual sketch

Tung D. Vu, Chung Hoang, Truong-Son Hy

TL;DR

This work tackles Design2Code by introducing a graph-enhanced multimodal framework that fuses visual, textual, and structural cues to generate HTML from UI designs. Key components include OCR-based text extraction, SAM-driven non-text component segmentation, a multimodal graph encoding textual and visual nodes, and a vision-language model that uses graph and vision conditioning through cross-attention. The approach leverages a Graph Convolutional Network with CLIP-informed node features, a Perceiver-based vision encoder, and Gated Cross Attention to produce content-aware HTML with improved layout fidelity. Extensive experiments on WebSight and Design2Code benchmarks demonstrate superior content accuracy and structural/layout alignment compared with baselines, underscoring the method’s potential to advance automated design-to-code workflows.

Abstract

The Design2Code problem, which involves converting digital designs into functional source code, is a significant challenge in software development due to its complexity and time-consuming nature. Traditional approaches often struggle with accurately interpreting the intricate visual details and structural relationships inherent in webpage designs, leading to limitations in automation and efficiency. In this paper, we propose a novel method that leverages multimodal graph representation learning to address these challenges. By integrating both visual and structural information from design sketches, our approach enhances the accuracy and efficiency of code generation, particularly in producing semantically correct and structurally sound HTML code. We present a comprehensive evaluation of our method, demonstrating significant improvements in both accuracy and efficiency compared to existing techniques. Extensive evaluation demonstrates significant improvements of multimodal graph learning over existing techniques, highlighting the potential of our method to revolutionize design-to-code automation. Code available at https://github.com/HySonLab/Design2Code

Multimodal graph representation learning for website generation based on visual sketch

TL;DR

This work tackles Design2Code by introducing a graph-enhanced multimodal framework that fuses visual, textual, and structural cues to generate HTML from UI designs. Key components include OCR-based text extraction, SAM-driven non-text component segmentation, a multimodal graph encoding textual and visual nodes, and a vision-language model that uses graph and vision conditioning through cross-attention. The approach leverages a Graph Convolutional Network with CLIP-informed node features, a Perceiver-based vision encoder, and Gated Cross Attention to produce content-aware HTML with improved layout fidelity. Extensive experiments on WebSight and Design2Code benchmarks demonstrate superior content accuracy and structural/layout alignment compared with baselines, underscoring the method’s potential to advance automated design-to-code workflows.

Abstract

The Design2Code problem, which involves converting digital designs into functional source code, is a significant challenge in software development due to its complexity and time-consuming nature. Traditional approaches often struggle with accurately interpreting the intricate visual details and structural relationships inherent in webpage designs, leading to limitations in automation and efficiency. In this paper, we propose a novel method that leverages multimodal graph representation learning to address these challenges. By integrating both visual and structural information from design sketches, our approach enhances the accuracy and efficiency of code generation, particularly in producing semantically correct and structurally sound HTML code. We present a comprehensive evaluation of our method, demonstrating significant improvements in both accuracy and efficiency compared to existing techniques. Extensive evaluation demonstrates significant improvements of multimodal graph learning over existing techniques, highlighting the potential of our method to revolutionize design-to-code automation. Code available at https://github.com/HySonLab/Design2Code

Paper Structure

This paper contains 25 sections, 3 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of the Graph-Enhanced Multimodal Architecture for Generating HTML Code from Visual Sketches. The architecture integrates visual and structural information through a Vision Encoder and a Graph Encoder, both of which condition the language model using GATED XATTN-DENSE blocks—our Cross-Attention mechanism for multimodal conditioning
  • Figure 2: Original screenshot
  • Figure 3: Text‐masked by PaddleOCR
  • Figure 4: SAM segmentation
  • Figure 6: Gated Cross Attention Block. The Gated Cross Attention Block integrates three input modalities - vision ($X$), language ($Y$), and and graph ($Z$) through cross-attention layers, followed by tanh gating to control information flow.