Table of Contents
Fetching ...

OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

Wenwen Yu, Zhibo Yang, Jianqiang Wan, Sibo Song, Jun Tang, Wenqing Cheng, Yuliang Liu, Xiang Bai

TL;DR

OmniParser V2 presents a universal, end-to-end framework for visually-situated text parsing by unifying text spotting, KIE, table recognition, and layout analysis under a single encoder-decoder with a two-stage Structured-Points-of-Thought prompting scheme. The SPOT framework decouples structure learning from content and region predictions via a token-router-based shared decoder, enabling improved performance, efficiency, and interpretability across diverse tasks. The approach is extended to Multimodal Large Language Models, where SPOT prompting enhances text localization and recognition, demonstrating broad generality beyond the core VsTP tasks. Empirical results on standard benchmarks show state-of-the-art or competitive performance across all tasks, with substantial model-size reductions over the conference version and strong ablations validating design choices. The work advances toward a generalized unified framework for document understanding and prompts further development of native text perception in MLLMs.

Abstract

Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individual tasks. This leads to modal isolation and complex workflows due to the diversified targets and heterogeneous schemas. In this paper, we introduce OmniParser V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis, into a unified framework. Central to our approach is the proposed Structured-Points-of-Thought (SPOT) prompting schemas, which improves model performance across diverse scenarios by leveraging a unified encoder-decoder architecture, objective, and input\&output representation. SPOT eliminates the need for task-specific architectures and loss functions, significantly simplifying the processing pipeline. Our extensive evaluations across four tasks on eight different datasets show that OmniParser V2 achieves state-of-the-art or competitive results in VsTP. Additionally, we explore the integration of SPOT within a multimodal large language model structure, further enhancing text localization and recognition capabilities, thereby confirming the generality of SPOT prompting technique. The code is available at \href{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}{AdvancedLiterateMachinery}.

OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

TL;DR

OmniParser V2 presents a universal, end-to-end framework for visually-situated text parsing by unifying text spotting, KIE, table recognition, and layout analysis under a single encoder-decoder with a two-stage Structured-Points-of-Thought prompting scheme. The SPOT framework decouples structure learning from content and region predictions via a token-router-based shared decoder, enabling improved performance, efficiency, and interpretability across diverse tasks. The approach is extended to Multimodal Large Language Models, where SPOT prompting enhances text localization and recognition, demonstrating broad generality beyond the core VsTP tasks. Empirical results on standard benchmarks show state-of-the-art or competitive performance across all tasks, with substantial model-size reductions over the conference version and strong ablations validating design choices. The work advances toward a generalized unified framework for document understanding and prompts further development of native text perception in MLLMs.

Abstract

Visually-situated text parsing (VsTP) has recently seen notable advancements, driven by the growing demand for automated document understanding and the emergence of large language models capable of processing document-based questions. While various methods have been proposed to tackle the complexities of VsTP, existing solutions often rely on task-specific architectures and objectives for individual tasks. This leads to modal isolation and complex workflows due to the diversified targets and heterogeneous schemas. In this paper, we introduce OmniParser V2, a universal model that unifies VsTP typical tasks, including text spotting, key information extraction, table recognition, and layout analysis, into a unified framework. Central to our approach is the proposed Structured-Points-of-Thought (SPOT) prompting schemas, which improves model performance across diverse scenarios by leveraging a unified encoder-decoder architecture, objective, and input\&output representation. SPOT eliminates the need for task-specific architectures and loss functions, significantly simplifying the processing pipeline. Our extensive evaluations across four tasks on eight different datasets show that OmniParser V2 achieves state-of-the-art or competitive results in VsTP. Additionally, we explore the integration of SPOT within a multimodal large language model structure, further enhancing text localization and recognition capabilities, thereby confirming the generality of SPOT prompting technique. The code is available at \href{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}{AdvancedLiterateMachinery}.

Paper Structure

This paper contains 23 sections, 1 equation, 7 figures, 10 tables.

Figures (7)

  • Figure 1: A task-agnostic architecture for visually-situated text parsing. The proposed OmniParser V2 takes an image and a task-specific structured-points-of-thought prompting as input and generates structured text sequences tailored to the specified task, including text spotting, key information extraction, table recognition, and layout analysis.
  • Figure 2: Schematic illustration of the proposed OmniParser V2 framework. The token-router-based shared decoder homogenizes four tasks through a unified structural points representation without designing task-specific branches. Furthermore, benefiting from decoupling points with content recognition and region prediction, the token-router-based shared decoder can generate polygonal contour and text content in parallel given the text points. SPS short for structured points sequence.
  • Figure 3: Illustration of the token-router-based shared decoder. Note that Add and LayerNorm layers are omitted for easy visualization. The Structured FFN, Detection FFN, and Recognition FFN have separate parameters, while all other modules within the shared decoder utilize the same parameters. Input token from different categories is routed through their corresponding class-specific FFNs. In the figure, different colors indicate the mapping between the input token and their respective FFNs.
  • Figure 4: Spatial-Window Prompting utilizes a 2-point prompt, denoted as $(x_{\texttt{left}}, y_{\texttt{top}}, x_{\texttt{right}}, y_{\texttt{bottom}})$, to specify the location of the prompting spatial window. Prefix-Window Prompting employs a 2-character prompt indicating the starting and ending characters of the prefix-window within the entire dictionary. The selected prefix range is highlighted in black, while others are shaded in gray. The outputs are the center points of two words: "Colchester" and "Greenstead", as their corresponding prefixes alphabets, "C" and "G" fall within the predefined prefix alphabet range ["B", "H"], but the word "and" is excluded because its prefix alphabet falls outside this range.
  • Figure 5: Illustration of the proposed structured-points-of-thought prompting applied to an existing multimodal large language model (MLLM) pipeline. In the first conversation, the original image is combined with Instruction 1 as a prompt, guiding the MLLM to generate a structured points sequence (Response 1), which represents the center points of each text instance in reading order for the text spotting task. In the second conversation, the first conversation is used as context, and Instruction 2 prompts the MLLM to generate both the location coordinates and text content corresponding to each center point within the structured points sequence (Response 2). Finally, the extracted information is formatted to produce the expected text spotting results.
  • ...and 2 more figures