Table of Contents
Fetching ...

Platypus: A Generalized Specialist Model for Reading Text in Various Forms

Peng Wang, Zhaohai Li, Jun Tang, Humen Zhong, Fei Huang, Zhibo Yang, Cong Yao

TL;DR

This paper addresses the challenge of reading text from images across diverse forms (natural scenes, documents, handwritten text, and formulas) by introducing Platypus, a generalized specialist model that unifies recognition tasks under a single architecture. The approach combines a Swin-B-based Image Encoder, a Prompt Encoder for task prompts, and an autoregressive 6-layer, 8-head Recognition Decoder to support RAT, PPR, and BPR in a single pass, with a two-phase training regime and a four-term loss L_total. A comprehensive Worms dataset is proposed to train and evaluate Platypus across text-reading tasks, and experiments on STS, STR, HTR, and MER demonstrate state-of-the-art or competitive performance alongside superior efficiency compared to both specialist models and MLLMs. The work highlights the effectiveness of prompt-guided, multi-task learning for robust text understanding across formats and points to future multilingual expansion and broader real-world deployment.

Abstract

Reading text from images (either natural scenes or documents) has been a long-standing research topic for decades, due to the high technical challenge and wide application range. Previously, individual specialist models are developed to tackle the sub-tasks of text reading (e.g., scene text recognition, handwritten text recognition and mathematical expression recognition). However, such specialist models usually cannot effectively generalize across different sub-tasks. Recently, generalist models (such as GPT-4V), trained on tremendous data in a unified way, have shown enormous potential in reading text in various scenarios, but with the drawbacks of limited accuracy and low efficiency. In this work, we propose Platypus, a generalized specialist model for text reading. Specifically, Platypus combines the best of both worlds: being able to recognize text of various forms with a single unified architecture, while achieving excellent accuracy and high efficiency. To better exploit the advantage of Platypus, we also construct a text reading dataset (called Worms), the images of which are curated from previous datasets and partially re-labeled. Experiments on standard benchmarks demonstrate the effectiveness and superiority of the proposed Platypus model. Model and data will be made publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/Platypus.

Platypus: A Generalized Specialist Model for Reading Text in Various Forms

TL;DR

This paper addresses the challenge of reading text from images across diverse forms (natural scenes, documents, handwritten text, and formulas) by introducing Platypus, a generalized specialist model that unifies recognition tasks under a single architecture. The approach combines a Swin-B-based Image Encoder, a Prompt Encoder for task prompts, and an autoregressive 6-layer, 8-head Recognition Decoder to support RAT, PPR, and BPR in a single pass, with a two-phase training regime and a four-term loss L_total. A comprehensive Worms dataset is proposed to train and evaluate Platypus across text-reading tasks, and experiments on STS, STR, HTR, and MER demonstrate state-of-the-art or competitive performance alongside superior efficiency compared to both specialist models and MLLMs. The work highlights the effectiveness of prompt-guided, multi-task learning for robust text understanding across formats and points to future multilingual expansion and broader real-world deployment.

Abstract

Reading text from images (either natural scenes or documents) has been a long-standing research topic for decades, due to the high technical challenge and wide application range. Previously, individual specialist models are developed to tackle the sub-tasks of text reading (e.g., scene text recognition, handwritten text recognition and mathematical expression recognition). However, such specialist models usually cannot effectively generalize across different sub-tasks. Recently, generalist models (such as GPT-4V), trained on tremendous data in a unified way, have shown enormous potential in reading text in various scenarios, but with the drawbacks of limited accuracy and low efficiency. In this work, we propose Platypus, a generalized specialist model for text reading. Specifically, Platypus combines the best of both worlds: being able to recognize text of various forms with a single unified architecture, while achieving excellent accuracy and high efficiency. To better exploit the advantage of Platypus, we also construct a text reading dataset (called Worms), the images of which are curated from previous datasets and partially re-labeled. Experiments on standard benchmarks demonstrate the effectiveness and superiority of the proposed Platypus model. Model and data will be made publicly available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/Platypus.
Paper Structure (39 sections, 2 equations, 4 figures, 11 tables)

This paper contains 39 sections, 2 equations, 4 figures, 11 tables.

Figures (4)

  • Figure 1: Comparative cases of Platypus against MLLMs (GPT-4V 2023GPT4VisionSC, Qwen-VL-Plus Bai2023QwenVLAV) and OCR tools (PaddleOCR, EasyOCR) on CAT Benchmark, highlighting word accuracy ratio (red brackets) and Platypus's RAT performance.
  • Figure 2: Comparison of our Platypus with previous OCR systems.
  • Figure 3: Overall architecture of our proposed Platypus model.
  • Figure 4: Qualitative results of Platypus. All images are derived from public datasets.