Table of Contents
Fetching ...

1D-Bench: A Benchmark for Iterative UI Code Generation with Visual Feedback in Real-World

Qiao Xu, Yipeng Yu, Chengxiao Feng, Xu Liu

TL;DR

Experiments on commercial and open-weight multimodal models show that iterative editing generally improves final performance by increasing rendering success and often improving visual similarity, and 1D-Bench, a benchmark grounded in real e-commerce workflows, represents the efficient completion of design-to-code tasks in less than one day.

Abstract

Design-to-code translates high-fidelity UI designs into executable front-end implementations, but progress remains hard to compare due to inconsistent datasets, toolchains, and evaluation protocols. We introduce 1D-Bench, a benchmark grounded in real e-commerce workflows, where each instance provides a reference rendering and an exported intermediate representation that may contain extraction errors. 1D is short for one day, representing the efficient completion of design-to-code tasks in less than one day. Models take both as input, using the intermediate representation as structural cues while being evaluated against the reference rendering, which tests robustness to intermediate representation defects rather than literal adherence. 1D-Bench requires generating an executable React codebase under a fixed toolchain with an explicit component hierarchy, and defines a multi-round setting in which models iteratively apply component-level edits using execution feedback. Experiments on commercial and open-weight multimodal models show that iterative editing generally improves final performance by increasing rendering success and often improving visual similarity. We further conduct a pilot study on post-training with synthetic repair trajectories and reinforcement learning based editing, and observe limited and unstable gains that may stem from sparse terminal rewards and high-variance file-level updates.

1D-Bench: A Benchmark for Iterative UI Code Generation with Visual Feedback in Real-World

TL;DR

Experiments on commercial and open-weight multimodal models show that iterative editing generally improves final performance by increasing rendering success and often improving visual similarity, and 1D-Bench, a benchmark grounded in real e-commerce workflows, represents the efficient completion of design-to-code tasks in less than one day.

Abstract

Design-to-code translates high-fidelity UI designs into executable front-end implementations, but progress remains hard to compare due to inconsistent datasets, toolchains, and evaluation protocols. We introduce 1D-Bench, a benchmark grounded in real e-commerce workflows, where each instance provides a reference rendering and an exported intermediate representation that may contain extraction errors. 1D is short for one day, representing the efficient completion of design-to-code tasks in less than one day. Models take both as input, using the intermediate representation as structural cues while being evaluated against the reference rendering, which tests robustness to intermediate representation defects rather than literal adherence. 1D-Bench requires generating an executable React codebase under a fixed toolchain with an explicit component hierarchy, and defines a multi-round setting in which models iteratively apply component-level edits using execution feedback. Experiments on commercial and open-weight multimodal models show that iterative editing generally improves final performance by increasing rendering success and often improving visual similarity. We further conduct a pilot study on post-training with synthetic repair trajectories and reinforcement learning based editing, and observe limited and unstable gains that may stem from sparse terminal rewards and high-variance file-level updates.
Paper Structure (44 sections, 10 equations, 10 figures, 2 tables)

This paper contains 44 sections, 10 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: (A) Dataset construction pipeline. (B) Dataset distribution. (C) One dataset example showing the reference UI image and its exported intermediate representation (IR). The IR may contain extraction defects, which propagate to the IR-mapped HTML rendering.
  • Figure 2: Task definition for single-round and multi-round generation.
  • Figure 3: (A) Synthetic preference data relating the score difference between R1 and R2 to human preference, shown as jittered scatter and binned means. (B) Multi-round final score trends with mean curves and $\pm$1 standard deviation bands. (C) Render success rates for initial and final rounds shown as paired bar charts. (D) Boxplots of metric breakdowns for initial and final rounds.
  • Figure 4: Overview of the pilot post-training study. (A) Synthetic trajectory construction for SFT and the segmented-rollout GRPO setup. (B) SFT training loss. (C) Mean similarity score during GRPO training. (D) KL divergence to the reference policy during GRPO training.
  • Figure 5: Prompts for VLM UI check
  • ...and 5 more figures