Table of Contents
Fetching ...

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue

TL;DR

ScreenCoder decomposes UI-to-code generation into grounding, planning, and generation with a modular multi-agent design to overcome perception and planning failures of monolithic MLLMs. It doubles as a scalable data engine, producing Screen-10K and a new ScreenBench benchmark, and applies a dual-stage post-training (SFT then RL with GRPO) to finely tune open-source MLLMs for pixel-accurate front-end synthesis. Results show state-of-the-art visual fidelity and structural coherence, along with strong qualitative evidence and human-in-the-loop benefits. The approach offers a practical path for production-ready UI automation and provides resources to advance open-model capabilities in multimodal program synthesis.

Abstract

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents

TL;DR

ScreenCoder decomposes UI-to-code generation into grounding, planning, and generation with a modular multi-agent design to overcome perception and planning failures of monolithic MLLMs. It doubles as a scalable data engine, producing Screen-10K and a new ScreenBench benchmark, and applies a dual-stage post-training (SFT then RL with GRPO) to finely tune open-source MLLMs for pixel-accurate front-end synthesis. Results show state-of-the-art visual fidelity and structural coherence, along with strong qualitative evidence and human-in-the-loop benefits. The approach offers a practical path for production-ready UI automation and provides resources to advance open-model capabilities in multimodal program synthesis.

Abstract

Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.

Paper Structure

This paper contains 35 sections, 4 equations, 14 figures, 4 tables, 1 algorithm.

Figures (14)

  • Figure 1: ScreenCoder accurately transforms website screenshots and design sketches into pixel-perfect front-end code. The figure showcases a variety of inputs on the left, including high-fidelity screenshots and a low-fidelity design sketch. The right column displays the corresponding webpages rendered from our model's generated code, demonstrating its high-fidelity replication capabilities.
  • Figure 2: Analysis of Common MLLM Failure Modes in UI-to-Code Generation. We identify two primary error categories: (a) Perception Errors, where the model fails to accurately interpret visual details, leading to missing icons or incorrect colors, and (b) Planning Errors, where the model fails to correctly reason about the spatial layout, resulting in elements being placed in the wrong positions.
  • Figure 3: Overview of ScreenCoder. Given UI screenshots or design sketches as input, the Grounding Agent first detects and labels key components (e.g., header, navbar, sidebar, content). The Planning Agent organizes these components into a hierarchical layout using front-end engineering priors. The Generation Agent synthesizes initial HTML code with placeholders, followed by content mapping to produce the final webpage and code.
  • Figure 4: Qualitative comparison of UI-to-code generation. While leading MLLMs fail to accurately replicate the target website's layout, styling, and component structure, our method, ScreenCoder, produces a high-fidelity result that closely matches the original design in both appearance and organization.
  • Figure 5: Qualitative comparison between our proposed method and various baselines.
  • ...and 9 more figures