Table of Contents
Fetching ...

WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

Yuhong Dai, Yanlin Lai, Mitt Huang, Hangyu Guo, Dingming Li, Hongbo Peng, Haodong Li, Yingxiu Zhao, Haoran Lyu, Zheng Ge, Xiangyu Zhang, Daxin Jiang

Abstract

Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19 models reveal substantial gaps in recreating fine-grained style and motion quality, while the rubric-based automatic evaluation achieves 96% agreement with human preferences. We release the dataset, evaluation toolkit, and baseline results to support future research on video-to-webpage generation.

WebVR: Benchmarking Multimodal LLMs for WebPage Recreation from Videos via Human-Aligned Visual Rubrics

Abstract

Existing web-generation benchmarks rely on text prompts or static screenshots as input. However, videos naturally convey richer signals such as interaction flow, transition timing, and motion continuity, which are essential for faithful webpage recreation. Despite this potential, video-conditioned webpage generation remains largely unexplored, with no dedicated benchmark for this task. To fill this gap, we introduce WebVR, a benchmark that evaluates whether MLLMs can faithfully recreate webpages from demonstration videos. WebVR contains 175 webpages across diverse categories, all constructed through a controlled synthesis pipeline rather than web crawling, ensuring varied and realistic demonstrations without overlap with existing online pages. We also design a fine-grained, human-aligned visual rubric that evaluates the generated webpages across multiple dimensions. Experiments on 19 models reveal substantial gaps in recreating fine-grained style and motion quality, while the rubric-based automatic evaluation achieves 96% agreement with human preferences. We release the dataset, evaluation toolkit, and baseline results to support future research on video-to-webpage generation.
Paper Structure (22 sections, 4 equations, 6 figures, 4 tables)

This paper contains 22 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Overview of the WebVR benchmark. The left panel illustrates the task distribution across various webpage categories, while the right panel details the automated evaluation pipeline, which executes generated code in a standardized sandbox and scores it against human-aligned visual rubrics.
  • Figure 2: A case study illustrating the conversion from an original video into generated code and its rendered video output.
  • Figure 3: The WebVR data synthesis pipeline. The process consists of four stages: (A) Seed Data Preparation via semantic re-theming, (B) Visual Asset Retrieval to ground specifications, (C) Candidate Generation and Execution using multiple MLLMs, and (D) Automated Filtering and Refinement to construct the final high-quality benchmark set.
  • Figure 4: Statistics of the WebVR benchmark dataset, showing the distributions of (a) reference video durations (in seconds) and (b) the number of visual rubric items per instance.
  • Figure 5: Hashed bars represent scores assigned without the visual rubric, while solid bars represent scores with the rubric applied. Note the inflated and compressed scoring distribution when the rubric is absent.
  • ...and 1 more figures