Table of Contents
Fetching ...

Enhancing Vision-Language Pre-training with Rich Supervisions

Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto

TL;DR

This work introduces S4, a strongly supervised pre-training framework for Vision-Language Models that exploits rich cues from automatically rendered web screenshots. By leveraging the hierarchical HTML DOM structure and spatial localization, S4 defines ten diverse tasks (e.g., Screen Parsing, OCR, Image/Element Grounding, Table Detection/Parsing, Layout Analysis) and trains on a large-scale dataset of 15M screenshots (S4 Data). The approach, built on a ViT encoder and Transformer decoder with coordinate tokens, demonstrates substantial improvements across nine downstream benchmarks, including up to 76.1% gains in Table Detection and notable gains in UI and web understanding tasks; ablations reveal the most impactful tasks and the importance of data scale. Overall, S4 shows that rich, automatically generated supervision from web rendering can significantly boost vision-language pre-training effectiveness, offering a scalable path toward more capable VL models in real-world UI, chart, and web understanding tasks.

Abstract

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.

Enhancing Vision-Language Pre-training with Rich Supervisions

TL;DR

This work introduces S4, a strongly supervised pre-training framework for Vision-Language Models that exploits rich cues from automatically rendered web screenshots. By leveraging the hierarchical HTML DOM structure and spatial localization, S4 defines ten diverse tasks (e.g., Screen Parsing, OCR, Image/Element Grounding, Table Detection/Parsing, Layout Analysis) and trains on a large-scale dataset of 15M screenshots (S4 Data). The approach, built on a ViT encoder and Transformer decoder with coordinate tokens, demonstrates substantial improvements across nine downstream benchmarks, including up to 76.1% gains in Table Detection and notable gains in UI and web understanding tasks; ablations reveal the most impactful tasks and the importance of data scale. Overall, S4 shows that rich, automatically generated supervision from web rendering can significantly boost vision-language pre-training effectiveness, offering a scalable path toward more capable VL models in real-world UI, chart, and web understanding tasks.

Abstract

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.
Paper Structure (34 sections, 13 figures, 3 tables)

This paper contains 34 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: We propose a novel pre-training paradigm - S4, composed of ten carefully designed tasks on large scale web-screenshots. Compared to image-to-text pretraining objectives on screenshots, which mainly utilized HTMLlee2023pix2struct or its subset like raw textskim2022ocrfreeli2023spotlight, our paradigm utilizes rich and diverse supervisions generated from web rendering that is also cheap to obtain.
  • Figure 2: Compared to traditional pre-training paradigms, our rich supervised pre-training leverages much more information that is also cheap to acquire (i.e via browser). We can then utilize the rich semantic and structural annotations to construct novel pre-training tasks that are naturally and directly aligned with downstream tasks. We use green words to refer to the words contained (visible) in the screenshot. We use red words to refer to the words that are not visible in the screenshot. For instance, “price” is not shown on the screenshot, but is the id of an element (refer to picture). We use brown words in the format of <x><y><x><y> to denote the bounding box.
  • Figure 3: Visualization of layout parsed from a screenshot. Corresponding HTML tags like <h1> are visualize on top-left corner of the bounding box.
  • Figure 4: Attribute Prediction
  • Figure 5: Element Grounding
  • ...and 8 more figures