Table of Contents
Fetching ...

TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models

Jonathan Fhima, Elad Ben Avraham, Oren Nuriel, Yair Kittenplon, Roy Ganz, Aviad Aberdam, Ron Litman

TL;DR

A novel method is introduced, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model and demonstrates consistent performance improvements when applying TAP-VL to top-performing VL models, across scene-text and document-based benchmarks.

Abstract

Vision-Language (VL) models have garnered considerable research interest; however, they still face challenges in effectively handling text within images. To address this limitation, researchers have developed two approaches. The first method involves utilizing external Optical Character Recognition (OCR) tools to extract textual information from images, which is then prepended to other textual inputs. The second strategy focuses on employing extremely high-resolution images to improve text recognition capabilities. In this paper, we focus on enhancing the first strategy by introducing a novel method, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model. TAP-VL employs a lightweight transformer-based OCR module to receive OCR with layout information, compressing it into a short fixed-length sequence for input into the LLM. Initially, we conduct model-agnostic pretraining of the OCR module on unlabeled documents, followed by its integration into any VL architecture through brief fine-tuning. Extensive experiments demonstrate consistent performance improvements when applying TAP-VL to top-performing VL models, across scene-text and document-based VL benchmarks.

TAP-VL: Text Layout-Aware Pre-training for Enriched Vision-Language Models

TL;DR

A novel method is introduced, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model and demonstrates consistent performance improvements when applying TAP-VL to top-performing VL models, across scene-text and document-based benchmarks.

Abstract

Vision-Language (VL) models have garnered considerable research interest; however, they still face challenges in effectively handling text within images. To address this limitation, researchers have developed two approaches. The first method involves utilizing external Optical Character Recognition (OCR) tools to extract textual information from images, which is then prepended to other textual inputs. The second strategy focuses on employing extremely high-resolution images to improve text recognition capabilities. In this paper, we focus on enhancing the first strategy by introducing a novel method, named TAP-VL, which treats OCR information as a distinct modality and seamlessly integrates it into any VL model. TAP-VL employs a lightweight transformer-based OCR module to receive OCR with layout information, compressing it into a short fixed-length sequence for input into the LLM. Initially, we conduct model-agnostic pretraining of the OCR module on unlabeled documents, followed by its integration into any VL architecture through brief fine-tuning. Extensive experiments demonstrate consistent performance improvements when applying TAP-VL to top-performing VL models, across scene-text and document-based VL benchmarks.

Paper Structure

This paper contains 20 sections, 8 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Layout-aware OCR Adapter. (a) Previous methods extract OCR data and input it into the LLM as plain text. (b) TAP-VL introduces a plug-and-play OCR adapter that leverages layout information and can be seamlessly integrated with any vision-language LLM (VLLM).
  • Figure 2: TAP-VL Approach. (a) Our model-agnostic layout-aware pretraining framework for creating condensed rich OCR embeddings conditioned on text. (b) TAP-VL fully integrated, enhancing any VL model on OCR-oriented tasks.
  • Figure 3: OCR-Grounded Mask Denoising. Denoising pretraining mechanism where the OCR-Q predicts the masked words, enhancing comprehension of unmasked text semantics and layout information.
  • Figure 4: OCR-Mask Contrastive Learning. Contrastive learning task aligning the learnable queries representation (interacting with the masked OCR) with the ones of the masked words.
  • Figure 5: OCR-Mask Matching. Visualization of the matching phase, aimed at enabling the OCR-Q to determine the correspondence between masked OCR content and the masked words.
  • ...and 4 more figures