Table of Contents
Fetching ...

PreSTU: Pre-Training for Scene-Text Understanding

Jihyung Kil, Soravit Changpinyo, Xi Chen, Hexiang Hu, Sebastian Goodman, Wei-Lun Chao, Radu Soricut

TL;DR

PreSTU introduces OCR-aware pre-training for scene-text understanding by combining a task-agnostic splitocr objective with task-specific vqa and cap objectives, using an end-to-end ViT+mT5 V&L architecture. Trained on CC15M and ST-VQA signals, it learns to recognize scene text from image pixels and connect it to visual context, yielding substantial gains across twelve STU benchmarks, including TextVQA, TextCaps, and OCR-VQA. The approach also demonstrates strong generalization to diverse scene-text domains and robustness to downstream OCR systems. Overall, PreSTU offers a scalable, OCR-centered pre-training paradigm that enhances both VQA and captioning tasks by grounding text in rich visual contexts.

Abstract

The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content. We implement PreSTU using a simple transformer-based encoder-decoder architecture, combined with large-scale image-text datasets with scene text obtained from an off-the-shelf OCR system. We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.

PreSTU: Pre-Training for Scene-Text Understanding

TL;DR

PreSTU introduces OCR-aware pre-training for scene-text understanding by combining a task-agnostic splitocr objective with task-specific vqa and cap objectives, using an end-to-end ViT+mT5 V&L architecture. Trained on CC15M and ST-VQA signals, it learns to recognize scene text from image pixels and connect it to visual context, yielding substantial gains across twelve STU benchmarks, including TextVQA, TextCaps, and OCR-VQA. The approach also demonstrates strong generalization to diverse scene-text domains and robustness to downstream OCR systems. Overall, PreSTU offers a scalable, OCR-centered pre-training paradigm that enhances both VQA and captioning tasks by grounding text in rich visual contexts.

Abstract

The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content. We implement PreSTU using a simple transformer-based encoder-decoder architecture, combined with large-scale image-text datasets with scene text obtained from an off-the-shelf OCR system. We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks.
Paper Structure (23 sections, 6 figures, 14 tables)

This paper contains 23 sections, 6 figures, 14 tables.

Figures (6)

  • Figure 1: Example of scene-text understanding (STU) tasks.NoPreSTU (baseline) and PreSTU share the same V&L model, but PreSTU is pre-trained on our proposed pre-training objectives. Scene texts are highlighted by bounding boxes. Unlike the baseline, PreSTU correctly predicts the title of the book on scene-text VQA (TextVQA textvqa) and even generates a more detailed scene-text caption (e.g., "united states space shuttle") than the ground-truth annotated by humans (TextCaps textcaps).
  • Figure 2: Our proposed pipeline. Left: Comparison between PreSTU and NoPreSTU (baseline) we want to compare against. Green denotes the PreSTU pre-training phase and yellow the downstream/fine-tuning phase. splitocr encourages scene-text recognition as well as the learning of the connection between scene text and its visual context; vqa and cap further strengthen that connection. Right: The text input and output for each objective. All objectives utilize OCR signals. See \ref{['fig:arch_prestu']} for the architecture of PreSTU.
  • Figure 3: V&L model architecture used in all of our experiments. We use a simple transformer-based encoder-decoder (pre-trained ViT vit + mT5 mt5) transforming image and text inputs to the text output. Green box: text input/output. Blue box: visual input. Yellow box: model blocks. See \ref{['fig:approach']} for the input-output pairs for different objectives.
  • Figure 4: PreSTU's OCR token prediction. The quality of OCR tokens generated by splitocr is comparable to that of gOCR system. This shows the possibility of leveraging splitocr as an alternative OCR system when other systems are not available.
  • Figure 5: gOCR tokens vs. PreSTU prediction on TextVQA. gOCR system does not detect some OCR tokens in the image (e.g., "13") or detects them incorrectly (e.g., "lexue"). This leads NoPreSTU to predict wrong answers (e.g., "5" or "cooper"). On the other hand, splitocr with gOCR tokens as input predicts the answers correctly with correct OCR tokens (e.g., "13" or "lexus").
  • ...and 1 more figures