Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao, Kunyu Shi, Pengkai Zhu, Edouard Belval, Oren Nuriel, Srikar Appalaraju, Shabnam Ghadar, Vijay Mahadevan, Zhuowen Tu, Stefano Soatto
TL;DR
This work introduces S4, a strongly supervised pre-training framework for Vision-Language Models that exploits rich cues from automatically rendered web screenshots. By leveraging the hierarchical HTML DOM structure and spatial localization, S4 defines ten diverse tasks (e.g., Screen Parsing, OCR, Image/Element Grounding, Table Detection/Parsing, Layout Analysis) and trains on a large-scale dataset of 15M screenshots (S4 Data). The approach, built on a ViT encoder and Transformer decoder with coordinate tokens, demonstrates substantial improvements across nine downstream benchmarks, including up to 76.1% gains in Table Detection and notable gains in UI and web understanding tasks; ablations reveal the most impactful tasks and the importance of data scale. Overall, S4 shows that rich, automatically generated supervision from web rendering can significantly boost vision-language pre-training effectiveness, offering a scalable path toward more capable VL models in real-world UI, chart, and web understanding tasks.
Abstract
We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1% on Widget Captioning.
