PreSTU: Pre-Training for Scene-Text Understanding
Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, Radu Soricut
TL;DR
PreSTU introduces OCR-aware pre-training for scene-text understanding by combining a task-agnostic splitocr objective with task-specific vqa and cap objectives, using an end-to-end ViT+mT5 V&L architecture. Trained on CC15M and ST-VQA signals, it learns to recognize scene text from image pixels and connect it to visual context, yielding substantial gains across twelve STU benchmarks, including TextVQA, TextCaps, and OCR-VQA. The approach also demonstrates strong generalization to diverse scene-text domains and robustness to downstream OCR systems. Overall, PreSTU offers a scalable, OCR-centered pre-training paradigm that enhances both VQA and captioning tasks by grounding text in rich visual contexts.
Abstract
The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content. We implement PreSTU using a simple transformer-based encoder-decoder architecture, combined with large-scale image-text datasets with scene text obtained from an off-the-shelf OCR system. We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
