Table of Contents
Fetching ...

OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition

Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, Zhibo Yang

TL;DR

A universal model, called OmniParser, which can simultaneously handle three typical visually-situated text parsing tasks: text spotting, key information extraction, and table recognition is devised, which achieves state-of-the-art or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks, despite its unified, concise design.

Abstract

Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous schemas, previous works usually design task-specific architectures and objectives for individual tasks, which inadvertently leads to modal isolation and complex workflow. In this paper, we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called OmniParser, which can simultaneously handle three typical visually-situated text parsing tasks: text spotting, key information extraction, and table recognition. In OmniParser, all tasks share the unified encoder-decoder architecture, the unified objective: point-conditioned text generation, and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks, despite its unified, concise design. The code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery.

OmniParser: A Unified Framework for Text Spotting, Key Information Extraction and Table Recognition

TL;DR

A universal model, called OmniParser, which can simultaneously handle three typical visually-situated text parsing tasks: text spotting, key information extraction, and table recognition is devised, which achieves state-of-the-art or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks, despite its unified, concise design.

Abstract

Recently, visually-situated text parsing (VsTP) has experienced notable advancements, driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However, due to the diversified targets and heterogeneous schemas, previous works usually design task-specific architectures and objectives for individual tasks, which inadvertently leads to modal isolation and complex workflow. In this paper, we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically, we devise a universal model, called OmniParser, which can simultaneously handle three typical visually-situated text parsing tasks: text spotting, key information extraction, and table recognition. In OmniParser, all tasks share the unified encoder-decoder architecture, the unified objective: point-conditioned text generation, and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks, despite its unified, concise design. The code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery.
Paper Structure (14 sections, 1 equation, 4 figures, 7 tables)

This paper contains 14 sections, 1 equation, 4 figures, 7 tables.

Figures (4)

  • Figure 1: A task-agnostic architecture for visually-situated text parsing. The proposed OmniParser takes an image and a task-specific indicator as input and generates structured text sequences tailored to the specified task, including text spotting, key information extraction, and table recognition.
  • Figure 2: Schematic illustration of the proposed OmniParser framework. Structured Points Decoder homogenizes three tasks through a unified structural points representation without designing task-specific branches. Furthermore, benefiting from decoupling points with content recognition and region prediction, the Region Decoder and Content Decoder can generate polygonal contour and text content in parallel given the text points.
  • Figure 3: Spatial-Window Prompting utilizes a 2-point prompt denoted as $(x_{\texttt{left}}, y_{\texttt{top}}, x_{\texttt{right}}, y_{\texttt{bottom}})$, which specifies the location of the prompting spatial window. Prefix-Window Prompting employs a 2-character prompt which indicates the starting and ending characters of the prefix-window with the entire dictionary. The selected prefix range is highlighted in black, while others are shaded in gray. The outputs comprise the center points of two words: "Harwich" and "Clacton", as the prefixes 'H' and 'C' fall within the predefined prefix range.
  • Figure 4: Qualitative results of text spotting (column 1-2), KIE (column 3), and table recognition (column 4). For KIE, points, polygons, and recognition are visualized. The color assigned to polygons indicates the entity type. For table recognition, we present point locations and a rendered table based on the prediction sequence, with an additional border for readability. Blue points and red points denote the GT and predicted points respectively. More details can be found in the supplementary material. (The figure is best viewed in color.)