ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents
Yilei Jiang, Yaozhi Zheng, Yuxuan Wan, Jiaming Han, Qunzhong Wang, Michael R. Lyu, Xiangyu Yue
TL;DR
ScreenCoder decomposes UI-to-code generation into grounding, planning, and generation with a modular multi-agent design to overcome perception and planning failures of monolithic MLLMs. It doubles as a scalable data engine, producing Screen-10K and a new ScreenBench benchmark, and applies a dual-stage post-training (SFT then RL with GRPO) to finely tune open-source MLLMs for pixel-accurate front-end synthesis. Results show state-of-the-art visual fidelity and structural coherence, along with strong qualitative evidence and human-in-the-loop benefits. The approach offers a practical path for production-ready UI automation and provides resources to advance open-model capabilities in multimodal program synthesis.
Abstract
Automating the transformation of user interface (UI) designs into front-end code holds significant promise for accelerating software development and democratizing design workflows. While multimodal large language models (MLLMs) can translate images to code, they often fail on complex UIs, struggling to unify visual perception, layout planning, and code synthesis within a single monolithic model, which leads to frequent perception and planning errors. To address this, we propose ScreenCoder, a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation. By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches. Furthermore, ScreenCoder serves as a scalable data engine, enabling us to generate high-quality image-code pairs. We use this data to fine-tune open-source MLLM via a dual-stage pipeline of supervised fine-tuning and reinforcement learning, demonstrating substantial gains in its UI generation capabilities. Extensive experiments demonstrate that our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness. Our code is made publicly available at https://github.com/leigest519/ScreenCoder.
