Table of Contents
Fetching ...

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

Ofir Abramovich, Niv Nayman, Sharon Fogel, Inbal Lavi, Ron Litman, Shahar Tsiper, Royee Tichauer, Srikar Appalaraju, Shai Mazor, R. Manmatha

TL;DR

VisFocus tackles the inefficiency of OCR-free dense document understanding by injecting the user prompt into the vision encoder, producing prompt-aware features $\hat{Z}_{\mathbf{p}}$ through Vision-Language Merging Attention (ViLMA) and guiding focus with a Localized Masked Prompt Modeling (LMPM) pre-training scheme. The method combines architectural changes in Swin transformers with a three-stage pre-training (LtR, LMPM, fine-tuning) to learn when and where to read text relevant to a given prompt. Empirical results across five document VQA benchmarks show consistent improvements over prior OCR-free approaches for small and base models, with ablations confirming the complementary value of ViLMA and LMPM and highlighting density-dependent gains. Overall, VisFocus demonstrates that prompt-aware visual encoding can achieve state-of-the-art performance on dense documents while maintaining efficiency and offering avenues for further prompt-guided pre-training across diverse document modalities.

Abstract

In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches. Typically, the queries to the model are input exclusively to the language component, necessitating the visual features to encompass the entire document. In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. To do so, we replace the down-sampling layers with layers that receive the input prompt and allow highlighting relevant parts of the document, while disregarding others. We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder in place of the prompt, to empower the model with focusing capabilities. Consequently, VisFocus learns to allocate its attention to text patches pertinent to the provided prompt. Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance, achieving state-of-the-art results on various benchmarks.

VisFocus: Prompt-Guided Vision Encoders for OCR-Free Dense Document Understanding

TL;DR

VisFocus tackles the inefficiency of OCR-free dense document understanding by injecting the user prompt into the vision encoder, producing prompt-aware features through Vision-Language Merging Attention (ViLMA) and guiding focus with a Localized Masked Prompt Modeling (LMPM) pre-training scheme. The method combines architectural changes in Swin transformers with a three-stage pre-training (LtR, LMPM, fine-tuning) to learn when and where to read text relevant to a given prompt. Empirical results across five document VQA benchmarks show consistent improvements over prior OCR-free approaches for small and base models, with ablations confirming the complementary value of ViLMA and LMPM and highlighting density-dependent gains. Overall, VisFocus demonstrates that prompt-aware visual encoding can achieve state-of-the-art performance on dense documents while maintaining efficiency and offering avenues for further prompt-guided pre-training across diverse document modalities.

Abstract

In recent years, notable advancements have been made in the domain of visual document understanding, with the prevailing architecture comprising a cascade of vision and language models. The text component can either be extracted explicitly with the use of external OCR models in OCR-based approaches, or alternatively, the vision model can be endowed with reading capabilities in OCR-free approaches. Typically, the queries to the model are input exclusively to the language component, necessitating the visual features to encompass the entire document. In this paper, we present VisFocus, an OCR-free method designed to better exploit the vision encoder's capacity by coupling it directly with the language prompt. To do so, we replace the down-sampling layers with layers that receive the input prompt and allow highlighting relevant parts of the document, while disregarding others. We pair the architecture enhancements with a novel pre-training task, using language masking on a snippet of the document text fed to the visual encoder in place of the prompt, to empower the model with focusing capabilities. Consequently, VisFocus learns to allocate its attention to text patches pertinent to the provided prompt. Our experiments demonstrate that this prompt-guided visual encoding approach significantly improves performance, achieving state-of-the-art results on various benchmarks.
Paper Structure (38 sections, 8 equations, 20 figures, 10 tables)

This paper contains 38 sections, 8 equations, 20 figures, 10 tables.

Figures (20)

  • Figure 1: VisFocus' key contributions. The left side of the figure illustrates how VisFocus enables the vision model to better align visual features to the input prompt; Unlike previous approaches, VisFocus inputs the prompt not only to the language model, but to the vision encoder as well (top left vs top middle). In addition, a novel pre-training task utilizes the enabled interactions with the prompt to focus the model on specific text patches (bottom middle) instead of the entire text (bottom left). The right side of the figure shows the resulting attention map from VisFocus illustrating how the model focuses on a specific word taken from the query ('Nursing').
  • Figure 2: An overview of the VisFocus architecture. The encoded prompt serves as an input for every ViLMA layer, at the end of each encoding stage (top). The goal of the ViLMA layers is to provide the encoder with prompt guidance during the down-sampling process. The encoded prompt is input through a cross attention layer before down-sampling (bottom).
  • Figure 3: Training Scheme. Previous methods only trained the model to read by predicting the OCR of the document (Stage I). We suggest an addition Localized Masked Prompt Modeling (Stage II) step to train the model to focus on a specific area of text inside the document.
  • Figure 3: Prompt Insertion Methods. Inserting the prompt via ViLMA layers improves results compared to previous approaches with only LtR pre-training applied (without LMPM). "Render"=question is rendered on the document image.
  • Figure 4: Attention maps of the last ViLMA layer. Textual regions relevant to the question tokens are highly activated when performing LMPM pre-training (top) compared to not performing this training stage (bottom). It can be seen that the model focuses its attention not only on the specific input word but also on related words, e.g when performing cross attention with the word "diameter" it focuses on the words "under-ream" and "180 degrees".
  • ...and 15 more figures