IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video?
Yang Chen, Minghao Liu, Yufan Shen, Yunwen Li, Tianyuan Huang, Xinyu Fang, Tianyu Zheng, Wenxuan Huang, Cheng Yang, Daocheng Fu, Jianbiao Mei, Rong Wu, Yunfei Zhao, Licheng Wen, Xuemeng Yang, Song Mao, Qunshu Lin, Zhi Yu, Yongliang Shen, Yu Qiao, Botian Shi
TL;DR
IWR-Bench introduces the first benchmark for Interactive Webpage Reconstruction from video, addressing the gap between static screenshot-to-code tasks and real-world, stateful web applications. By pairing user-interaction videos with crawled static assets and evaluating generated HTML/CSS/JS via an automated agent-judge framework, the paper defines two core metrics—Interactive Functionality Score ($\text{IFS}$) and Visual Fidelity Score ($\text{VFS}$)—and a final score that emphasizes functional correctness. Across 28 LVLMs, results show a large gap between visual fidelity and functional synthesis, with the top end under 40% Final Score and IFS markedly lower than VFS, signaling substantial room for improvement in temporal reasoning and event-driven logic. The work provides a robust, publicly available benchmark and evaluation protocol that will guide future research toward temporally coherent, interactive web generation from multimodal inputs.
Abstract
The webpage-to-code task requires models to understand visual representations of webpages and generate corresponding code. However, existing benchmarks primarily focus on static screenshot-to-code tasks, thereby overlooking the dynamic interactions fundamental to real-world web applications. To address this limitation, this paper introduces IWR-Bench, a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video. IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions and featuring diverse interaction complexities (e.g., web games), visual styles, and domains. Aligning with standard web development practices, each task includes not only user interaction videos but also all crawled static assets (e.g., images, videos). This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code. An agent-as-a-judge framework with a comprehensive metric system automatically assesses the functional correctness and visual fidelity of generated webpages. Extensive experiments on 28 LVLMs reveal a significant challenge: the best model achieves an overall score of only 36.35%, as functional correctness (24.39% IFS) lags significantly behind visual fidelity (64.25% VFS). These results highlight critical limitations in current models' ability to reason about temporal dynamics and synthesize event-driven logic, establishing IWR-Bench as a challenging frontier for vision-language research. The benchmark and evaluation code will be made publicly available at https://github.com/SIGMME/IWR-Bench.
