Table of Contents
Fetching ...

IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web

Hongcheng Guo, Wei Zhang, Junhao Chen, Yaonan Gu, Jian Yang, Junjia Du, Shaosheng Cao, Binyuan Hui, Tianyu Liu, Jianxin Ma, Chang Zhou, Zhoujun Li

TL;DR

IW-Bench introduces two novel metrics, Element Accuracy and Layout Accuracy, to rigorously evaluate image-to-web code generation by parsing DOM structures and linearizing layout. It couples this with a five-hop multimodal chain-of-thought framework to guide model reasoning from element inference to code generation, achieving improved performance across a diverse set of large multimodal models. The benchmark uses a 1200-entry dataset with three difficulty levels, combining real and GPT-4-generated web pages, and demonstrates that WebSight and GPT-4V with CoT offer strong performance while highlighting gaps in layout understanding for more complex scenes. The work provides a concrete, extensible evaluation protocol with human-in-the-loop validation and ablations, laying groundwork for more robust Image-to-Web benchmarks and model improvements.

Abstract

Recently advancements in large multimodal models have led to significant strides in image comprehension capabilities. Despite these advancements, there is a lack of the robust benchmark specifically for assessing the Image-to-Web conversion proficiency of these large models. Primarily, it is essential to ensure the integrity of the web elements generated. These elements comprise visible and invisible categories. Previous evaluation methods (e.g.,BLEU) are notably susceptible to significant alterations due to the presence of invisible elements in Web. Furthermore, it is crucial to measure the layout information of web pages, referring to the positional relationships between elements, which is overlooked by previous work. To address challenges, we have curated and aligned a benchmark of images and corresponding web codes (IW-BENCH). Specifically, we propose the Element Accuracy, which tests the completeness of the elements by parsing the Document Object Model (DOM) tree. Layout Accuracy is also proposed to analyze the positional relationships of elements by converting DOM tree into a common subsequence. Besides, we design a five-hop multimodal Chain-of-Thought Prompting for better performance, which contains five hop: 1) SoM prompt injection. 2) Inferring Elements. 3) Inferring Layout. 4) Inferring Web code. 5) Reflection. Our benchmark comprises 1200 pairs of images and web codes with varying levels of difficulty. We have conducted extensive experiments on existing large multimodal models, offering insights into their performance and areas for improvement in image-to-web domain.

IW-Bench: Evaluating Large Multimodal Models for Converting Image-to-Web

TL;DR

IW-Bench introduces two novel metrics, Element Accuracy and Layout Accuracy, to rigorously evaluate image-to-web code generation by parsing DOM structures and linearizing layout. It couples this with a five-hop multimodal chain-of-thought framework to guide model reasoning from element inference to code generation, achieving improved performance across a diverse set of large multimodal models. The benchmark uses a 1200-entry dataset with three difficulty levels, combining real and GPT-4-generated web pages, and demonstrates that WebSight and GPT-4V with CoT offer strong performance while highlighting gaps in layout understanding for more complex scenes. The work provides a concrete, extensible evaluation protocol with human-in-the-loop validation and ablations, laying groundwork for more robust Image-to-Web benchmarks and model improvements.

Abstract

Recently advancements in large multimodal models have led to significant strides in image comprehension capabilities. Despite these advancements, there is a lack of the robust benchmark specifically for assessing the Image-to-Web conversion proficiency of these large models. Primarily, it is essential to ensure the integrity of the web elements generated. These elements comprise visible and invisible categories. Previous evaluation methods (e.g.,BLEU) are notably susceptible to significant alterations due to the presence of invisible elements in Web. Furthermore, it is crucial to measure the layout information of web pages, referring to the positional relationships between elements, which is overlooked by previous work. To address challenges, we have curated and aligned a benchmark of images and corresponding web codes (IW-BENCH). Specifically, we propose the Element Accuracy, which tests the completeness of the elements by parsing the Document Object Model (DOM) tree. Layout Accuracy is also proposed to analyze the positional relationships of elements by converting DOM tree into a common subsequence. Besides, we design a five-hop multimodal Chain-of-Thought Prompting for better performance, which contains five hop: 1) SoM prompt injection. 2) Inferring Elements. 3) Inferring Layout. 4) Inferring Web code. 5) Reflection. Our benchmark comprises 1200 pairs of images and web codes with varying levels of difficulty. We have conducted extensive experiments on existing large multimodal models, offering insights into their performance and areas for improvement in image-to-web domain.
Paper Structure (48 sections, 4 equations, 10 figures, 7 tables)

This paper contains 48 sections, 4 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: The process of Image-to-Web. We prompt a large multimodal model to generate web code based on the input image. Finally, we need to compare whether the newly rendered result is consistent with the input image.
  • Figure 2: Wordcloud. The key words in IW-Bench are related to the web and internet, such as 'html', 'header'.
  • Figure 3: Benchmark Construction. This pipline illustrates the multi-step process used to construct IW-Bench for web code and images of varying complexity levels.
  • Figure 4: Overview of Five-hop Multimodal Chain-of-Thought Prompting. Our method contains five hop: 1) SoM prompt injection. 2) Inferring Elements. 3) Inferring Layout. 4)Inferring Web code. 5) Reflection.
  • Figure 5: Example of SOM prompt injection. The image on the left is the original web page, and the image on the right is the rendered web page after injection.
  • ...and 5 more figures