Table of Contents
Fetching ...

Advancing vision-language models in front-end development via data synthesis

Tong Ge, Yashu Liu, Jieping Ye, Tianyi Li, Chao Wang

TL;DR

The paper tackles the gap between design images and executable front‑end code by introducing Flame, a vision‑language model tailored for React code generation. It presents a self‑reflective, data‑synthesis pipeline with three complementary synthesis strategies—Evolution‑Based, Waterfall‑Model‑Based, and Additive Development—that produce large, diverse, self‑contained image–text pairs linking visuals to functional code and layout descriptions. Through a three‑stage training regimen and a rigorous benchmark (Flame‑React‑Eval) using pass@k with image‑embedding similarity, Flame demonstrates superior React code generation compared to SOTA models, particularly when image interpretation is integrated before coding. The work further provides open data and models, illustrating the practical impact of multimodal learning for end‑to‑end FE development and setting a foundation for extending to additional frameworks and multi‑turn interactions in future research.

Abstract

Modern front-end (FE) development, especially when leveraging the unique features of frameworks like React and Vue, presents distinctive challenges. These include managing modular architectures, ensuring synchronization between data and visual outputs for declarative rendering, and adapting reusable components to various scenarios. Such complexities make it particularly difficult for state-of-the-art large vision-language models (VLMs) to generate accurate and functional code directly from design images. To address these challenges, we propose a reflective agentic workflow that synthesizes high-quality image-text data to capture the diverse characteristics of FE development. This workflow automates the extraction of self-contained\footnote{A \textbf{self-contained} code snippet is one that encapsulates all necessary logic, styling, and dependencies, ensuring it functions independently without requiring external imports or context.} code snippets from real-world projects, renders the corresponding visual outputs, and generates detailed descriptions that link design elements to functional code. To further expand the scope and utility of the synthesis, we introduce three data synthesis strategies: Evolution-based synthesis, which enables scalable and diverse dataset expansion; Waterfall-Model-based synthesis, which generates logically coherent code derived from system requirements; and Additive Development synthesis, which iteratively increases the complexity of human-authored components. We build a large vision-language model, Flame, trained on the synthesized datasets and demonstrate its effectiveness in generating React code via the $\text{pass}@k$ metric. Our results suggest that a code VLM trained to interpret images before code generation may achieve better performance.

Advancing vision-language models in front-end development via data synthesis

TL;DR

The paper tackles the gap between design images and executable front‑end code by introducing Flame, a vision‑language model tailored for React code generation. It presents a self‑reflective, data‑synthesis pipeline with three complementary synthesis strategies—Evolution‑Based, Waterfall‑Model‑Based, and Additive Development—that produce large, diverse, self‑contained image–text pairs linking visuals to functional code and layout descriptions. Through a three‑stage training regimen and a rigorous benchmark (Flame‑React‑Eval) using pass@k with image‑embedding similarity, Flame demonstrates superior React code generation compared to SOTA models, particularly when image interpretation is integrated before coding. The work further provides open data and models, illustrating the practical impact of multimodal learning for end‑to‑end FE development and setting a foundation for extending to additional frameworks and multi‑turn interactions in future research.

Abstract

Modern front-end (FE) development, especially when leveraging the unique features of frameworks like React and Vue, presents distinctive challenges. These include managing modular architectures, ensuring synchronization between data and visual outputs for declarative rendering, and adapting reusable components to various scenarios. Such complexities make it particularly difficult for state-of-the-art large vision-language models (VLMs) to generate accurate and functional code directly from design images. To address these challenges, we propose a reflective agentic workflow that synthesizes high-quality image-text data to capture the diverse characteristics of FE development. This workflow automates the extraction of self-contained\footnote{A \textbf{self-contained} code snippet is one that encapsulates all necessary logic, styling, and dependencies, ensuring it functions independently without requiring external imports or context.} code snippets from real-world projects, renders the corresponding visual outputs, and generates detailed descriptions that link design elements to functional code. To further expand the scope and utility of the synthesis, we introduce three data synthesis strategies: Evolution-based synthesis, which enables scalable and diverse dataset expansion; Waterfall-Model-based synthesis, which generates logically coherent code derived from system requirements; and Additive Development synthesis, which iteratively increases the complexity of human-authored components. We build a large vision-language model, Flame, trained on the synthesized datasets and demonstrate its effectiveness in generating React code via the metric. Our results suggest that a code VLM trained to interpret images before code generation may achieve better performance.

Paper Structure

This paper contains 40 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Overview of the data preparation pipeline. The process involves extracting self-contained code snippets from GitHub repositories and applying three code synthesis methods (blue) to enhance dataset diversity. Extracted snippets are then rendered to generate corresponding visual representations (orange), followed by the generation of structured layout descriptions for the components (purple).
  • Figure 2: Comparison of reference image and GPT-4o generated result.
  • Figure 3: Self-contained code snippet containing both the component style (left) and implementation code (right).
  • Figure 4: Screenshot of the example of synthesized image-text instance
  • Figure :
  • ...and 1 more figures