Table of Contents
Fetching ...

VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

Jie Deng, Kaichun Yao, Libo Zhang

TL;DR

VisRefiner tackles the problem of screenshot-to-code generation by learning from visual differences between rendered outputs and target designs. It introduces a two-stage training regime: difference-aligned supervision that grounds code edits to perceptual gaps, and GRPO-based self-refinement that optimizes code by rewarding perceptual improvements. The framework relies on the VisDiffUI dataset to provide paired visual discrepancies and code edits, enabling stable forward generation and guided refinement. Empirical results show improved visual fidelity, layout alignment, and active refinement behavior, advancing multimodal code synthesis toward human-like perception-guided reasoning.

Abstract

Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.

VisRefiner: Learning from Visual Differences for Screenshot-to-Code Generation

TL;DR

VisRefiner tackles the problem of screenshot-to-code generation by learning from visual differences between rendered outputs and target designs. It introduces a two-stage training regime: difference-aligned supervision that grounds code edits to perceptual gaps, and GRPO-based self-refinement that optimizes code by rewarding perceptual improvements. The framework relies on the VisDiffUI dataset to provide paired visual discrepancies and code edits, enabling stable forward generation and guided refinement. Empirical results show improved visual fidelity, layout alignment, and active refinement behavior, advancing multimodal code synthesis toward human-like perception-guided reasoning.

Abstract

Screenshot-to-code generation aims to translate user interface screenshots into executable frontend code that faithfully reproduces the target layout and style. Existing multimodal large language models perform this mapping directly from screenshots but are trained without observing the visual outcomes of their generated code. In contrast, human developers iteratively render their implementation, compare it with the design, and learn how visual differences relate to code changes. Inspired by this process, we propose VisRefiner, a training framework that enables models to learn from visual differences between rendered predictions and reference designs. We construct difference-aligned supervision that associates visual discrepancies with corresponding code edits, allowing the model to understand how appearance variations arise from implementation changes. Building on this, we introduce a reinforcement learning stage for self-refinement, where the model improves its generated code by observing both the rendered output and the target design, identifying their visual differences, and updating the code accordingly. Experiments show that VisRefiner substantially improves single-step generation quality and layout fidelity, while also endowing models with strong self-refinement ability. These results demonstrate the effectiveness of learning from visual differences for advancing screenshot-to-code generation.
Paper Structure (50 sections, 15 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 50 sections, 15 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of training paradigms for screenshot-to-code generation. (a) Original approaches train models through one-way supervised mapping from target designs to code. (b) VisRefiner introduces a difference-driven training paradigm, where the model learns from visual discrepancies between its rendered output and the target design.
  • Figure 2: Overview of the proposed VisRefiner framework. The process begins with constructing a visual difference-aligned corpus that provides paired examples of visual deviations and their corresponding code edits. Stage 1 learns to interpret these visual differences through difference-aligned supervision, grounding visual understanding in code updates. Stage 2 applies GRPO-based optimization with self-refinement, where the model refines its own predictions based on perceptual rewards derived from rendered similarity improvements. Together, these stages enable multimodal LLMs to learn directly from visual differences during training.
  • Figure 3: Representative categories of difference-aligned perturbations used in constructing training pairs. Each group shows a reference UI on the left and its perturbed counterpart on the right. The perturbations cover six visual dimensions including color, layout, alignment, component, image, and text. They introduce localized inconsistencies such as color drift, misalignment, component removal, and text resizing, which establish fine-grained supervision linking code edits to perceptual differences.
  • Figure 4: Human evaluation of one-step self-refinement. Annotators compared each model’s refined output against its own single-step generation, labeling each comparison as Win (improved), Tie (unchanged), or Lose (degraded).
  • Figure 5: Representative examples from the seed version of VisDiffUI, covering diverse layouts and component structures.
  • ...and 7 more figures