Table of Contents
Fetching ...

Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

Josselin Somerville Roberts, Tony Lee, Chi Heem Wong, Michihiro Yasunaga, Yifan Mai, Percy Liang

TL;DR

The benchmark Image2Struct is introduced, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images and finds that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs.

Abstract

We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at https://crfm.stanford.edu/helm/image2struct/v1.0.1/.

Image2Struct: Benchmarking Structure Extraction for Vision-Language Models

TL;DR

The benchmark Image2Struct is introduced, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images and finds that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs.

Abstract

We introduce Image2Struct, a benchmark to evaluate vision-language models (VLMs) on extracting structure from images. Our benchmark 1) captures real-world use cases, 2) is fully automatic and does not require human judgment, and 3) is based on a renewable stream of fresh data. In Image2Struct, VLMs are prompted to generate the underlying structure (e.g., LaTeX code or HTML) from an input image (e.g., webpage screenshot). The structure is then rendered to produce an output image (e.g., rendered webpage), which is compared against the input image to produce a similarity score. This round-trip evaluation allows us to quantitatively evaluate VLMs on tasks with multiple valid structures. We create a pipeline that downloads fresh data from active online communities upon execution and evaluates the VLMs without human intervention. We introduce three domains (Webpages, LaTeX, and Musical Scores) and use five image metrics (pixel similarity, cosine similarity between the Inception vectors, learned perceptual image patch similarity, structural similarity index measure, and earth mover similarity) that allow efficient and automatic comparison between pairs of images. We evaluate Image2Struct on 14 prominent VLMs and find that scores vary widely, indicating that Image2Struct can differentiate between the performances of different VLMs. Additionally, the best score varies considerably across domains (e.g., 0.402 on sheet music vs. 0.830 on LaTeX equations), indicating that Image2Struct contains tasks of varying difficulty. For transparency, we release the full results at https://crfm.stanford.edu/helm/image2struct/v1.0.1/.

Paper Structure

This paper contains 52 sections, 8 equations, 23 figures, 8 tables.

Figures (23)

  • Figure 1:
  • Figure 2: In Image2Struct tasks, given an input image, the goal is to produce a structure (e.g., LaTeX code), so that the rendering of the structure produces the original image. We include three domains: Webpages, LaTeX, and Musical scores. We show an example of the input image, model predicted structure, and rendered image for each of the domains in our benchmark.
  • Figure 3: Our pipeline for evaluation using the example of LaTeX. First, we download data from online sources. Second, we filter and process the images. Third, we prompt the VLMs with these images to produce output structures. Fourth, we render the the structures and finally evaluate the rendered images by comparing the rendered images against the input images.
  • Figure 4: Example model predictions for LaTeX (equation), LaTeX (Plot), and Music tasks. Results in the domain of Webpages, as well as additional instances of LaTeX, can be found in \ref{['appendix:results_examples']}.
  • Figure A1: An illustration of the two scales at which EMD$_\text{block}$ operates. The left image is an altered copy of the right one in that 4 patches are manipulated. EMD$_\text{block}$ computes an optimal flow where 3 of these patches (in red) are moved completely without modification. For the blue patch, it decides that it incurs a lower cost to move some pixels within the patch (the zoomed version on the right). On top of moving blocks or pixels, EMD$_\text{block}$ can change the pixel colors at a cost (we do not illustrate color modification in this example for simplicity).
  • ...and 18 more figures