Table of Contents
Fetching ...

VISTA-OCR: Towards generative and interactive end to end OCR models

Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet

TL;DR

VISTA-OCR tackles the need for end-to-end OCR that is both text- and layout-aware while avoiding the computational burden of large vision-language models. It achieves this with a lightweight encoder–decoder where a Transformer decoder jointly generates text tokens and spatial coordinates in a single sequence, trained progressively with multitask prompts that enable region-based and content-based localization. The authors introduce real and synthetic line-annotated datasets to support pretraining and evaluation across printed and handwritten documents. Across SROIE, IAM, RIMES, MAURDOR, and synthetic benchmarks, VISTA-ft often achieves state-of-the-art or competitive results on text recognition and detection, while VISTA-omni demonstrates notable cross-domain generalization with roughly 150M parameters. This approach paves the way for interactive OCR applications that are efficient, scalable, and adaptable to diverse document understanding tasks.

Abstract

We introduce \textbf{VISTA-OCR} (Vision and Spatially-aware Text Analysis OCR), a lightweight architecture that unifies text detection and recognition within a single generative model. Unlike conventional methods that require separate branches with dedicated parameters for text recognition and detection, our approach leverages a Transformer decoder to sequentially generate text transcriptions and their spatial coordinates in a unified branch. Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase, followed by multitask learning with multimodal token generation. To address the increasing demand for versatile OCR systems capable of advanced tasks, such as content-based text localization \ref{content_based_localization}, we introduce new prompt-controllable OCR tasks during pre-training.To enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples. Although recent Vision Large Language Models (VLLMs) can efficiently perform these tasks, their high computational cost remains a barrier for practical deployment. In contrast, our VISTA$_{\text{omni}}$ variant processes both handwritten and printed documents with only 150M parameters, interactively, by prompting. Extensive experiments on multiple datasets demonstrate that VISTA-OCR achieves better performance compared to state-of-the-art specialized models on standard OCR tasks while showing strong potential for more sophisticated OCR applications, addressing the growing need for interactive OCR systems. All code and annotations for VISTA-OCR will be made publicly available upon acceptance.

VISTA-OCR: Towards generative and interactive end to end OCR models

TL;DR

VISTA-OCR tackles the need for end-to-end OCR that is both text- and layout-aware while avoiding the computational burden of large vision-language models. It achieves this with a lightweight encoder–decoder where a Transformer decoder jointly generates text tokens and spatial coordinates in a single sequence, trained progressively with multitask prompts that enable region-based and content-based localization. The authors introduce real and synthetic line-annotated datasets to support pretraining and evaluation across printed and handwritten documents. Across SROIE, IAM, RIMES, MAURDOR, and synthetic benchmarks, VISTA-ft often achieves state-of-the-art or competitive results on text recognition and detection, while VISTA-omni demonstrates notable cross-domain generalization with roughly 150M parameters. This approach paves the way for interactive OCR applications that are efficient, scalable, and adaptable to diverse document understanding tasks.

Abstract

We introduce \textbf{VISTA-OCR} (Vision and Spatially-aware Text Analysis OCR), a lightweight architecture that unifies text detection and recognition within a single generative model. Unlike conventional methods that require separate branches with dedicated parameters for text recognition and detection, our approach leverages a Transformer decoder to sequentially generate text transcriptions and their spatial coordinates in a unified branch. Built on an encoder-decoder architecture, VISTA-OCR is progressively trained, starting with the visual feature extraction phase, followed by multitask learning with multimodal token generation. To address the increasing demand for versatile OCR systems capable of advanced tasks, such as content-based text localization \ref{content_based_localization}, we introduce new prompt-controllable OCR tasks during pre-training.To enhance the model's capabilities, we built a new dataset composed of real-world examples enriched with bounding box annotations and synthetic samples. Although recent Vision Large Language Models (VLLMs) can efficiently perform these tasks, their high computational cost remains a barrier for practical deployment. In contrast, our VISTA variant processes both handwritten and printed documents with only 150M parameters, interactively, by prompting. Extensive experiments on multiple datasets demonstrate that VISTA-OCR achieves better performance compared to state-of-the-art specialized models on standard OCR tasks while showing strong potential for more sophisticated OCR applications, addressing the growing need for interactive OCR systems. All code and annotations for VISTA-OCR will be made publicly available upon acceptance.

Paper Structure

This paper contains 35 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Synthetic image with the corresponding OCR and locations transcription. Each line transcription is delimited by the spatial tokens that encode the upper (resp. lower) position of its bounding box.
  • Figure 2: Overall architecture consists of a CNN vision encoder and a Transformer decoder that takes the visual features and a prompt to output sequentialy the textual and location tokens
  • Figure 3: Samples from real datasets enriched with text line level annotations
  • Figure 4: Synthetic samples
  • Figure 5: t-SNE $2$ dimensional representations of locations tokens embeddings
  • ...and 2 more figures