Table of Contents
Fetching ...

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

Zehai He, Wenyi Hong, Zhen Yang, Ziyang Pan, Mingdao Liu, Xiaotao Gu, Jie Tang

Abstract

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.

Vision2Web: A Hierarchical Benchmark for Visual Website Development with Agent Verification

Abstract

Recent advances in large language models have improved the capabilities of coding agents, yet systematic evaluation of complex, end-to-end website development remains limited. To address this gap, we introduce Vision2Web, a hierarchical benchmark for visual website development, spanning from static UI-to-code generation, interactive multi-page frontend reproduction, to long-horizon full-stack website development. The benchmark is constructed from real-world websites and comprises a total of 193 tasks across 16 categories, with 918 prototype images and 1,255 test cases. To support flexible, thorough and reliable evaluation, we propose workflow-based agent verification paradigm based on two complementary components: a GUI agent verifier and a VLM-based judge. We evaluate multiple visual language models instantiated under different coding-agent frameworks, revealing substantial performance gaps at all task levels, with state-of-the-art models still struggling on full-stack development.

Paper Structure

This paper contains 32 sections, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Overview of Vision2Web, a hierarchical benchmark for visual website development. Tasks span three levels—static webpages, interactive frontends, and full-stack websites—requiring agents to integrate visual prototypes with textual specifications. Evaluation is performed via a workflow-based agent verification paradigm, measuring functional correctness and visual fidelity.
  • Figure 2: Task distribution of Vision2Web across four major categories and 16 subcategories.
  • Figure 3: Distribution of test cases across website-level tasks in Vision2Web.
  • Figure 4: Distribution of Visual Scores (VS) across prototype heights for representative models under the OpenHands framework.
  • Figure 5: Distribution of prototype image sizes across different device types.
  • ...and 7 more figures