Logics-Parsing-Omni Technical Report

Xin An; Jingyi Cai; Xiangyang Chen; Huayao Liu; Peiting Liu; Peng Wang; Bei Yang; Xiuwen Zhu; Yongfan Chen; Baoyu Hou; Shuzhao Li; Weidong Ren; Fan Yang; Jiangtao Zhang; Xiaoxiao Xu; Lin Qu

Logics-Parsing-Omni Technical Report

Xin An, Jingyi Cai, Xiangyang Chen, Huayao Liu, Peiting Liu, Peng Wang, Bei Yang, Xiuwen Zhu, Yongfan Chen, Baoyu Hou, Shuzhao Li, Weidong Ren, Fan Yang, Jiangtao Zhang, Xiaoxiao Xu, Lin Qu

TL;DR

This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition and constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge.

Abstract

Addressing the challenges of fragmented task definitions and the heterogeneity of unstructured data in multimodal parsing, this paper proposes the Omni Parsing framework. This framework establishes a Unified Taxonomy covering documents, images, and audio-visual streams, introducing a progressive parsing paradigm that bridges perception and cognition. Specifically, the framework integrates three hierarchical levels: 1) Holistic Detection, which achieves precise spatial-temporal grounding of objects or events to establish a geometric baseline for perception; 2) Fine-grained Recognition, which performs symbolization (e.g., OCR/ASR) and attribute extraction on localized objects to complete structured entity parsing; and 3) Multi-level Interpreting, which constructs a reasoning chain from local semantics to global logic. A pivotal advantage of this framework is its evidence anchoring mechanism, which enforces a strict alignment between high-level semantic descriptions and low-level facts. This enables ``evidence-based'' logical induction, transforming unstructured signals into standardized knowledge that is locatable, enumerable, and traceable. Building on this foundation, we constructed a standardized dataset and released the Logics-Parsing-Omni model, which successfully converts complex audio-visual signals into machine-readable structured knowledge. Experiments demonstrate that fine-grained perception and high-level cognition are synergistic, effectively enhancing model reliability. Furthermore, to quantitatively evaluate these capabilities, we introduce OmniParsingBench. Code, models and the benchmark are released at https://github.com/alibaba/Logics-Parsing/tree/master/Logics-Parsing-Omni.

Logics-Parsing-Omni Technical Report

TL;DR

Abstract

Paper Structure (41 sections, 17 figures, 18 tables)

This paper contains 41 sections, 17 figures, 18 tables.

Introduction
Methodology
Overview
Dataset
Image
Natural Images.
Graphics.
Document
Audio
Video
Natural Videos.
Text-Rich Videos.
Training
Evaluation
OmniParsingBench
...and 26 more sections

Figures (17)

Figure 1: OmniParsingBench performance of Logics-Parsing-Omni.
Figure 2: Showcase of the multifaceted capabilities of Logics-Parsing-Omni.
Figure 3: The construction of unified multi-modal parsing corpus and training pipeline of our proposed Logics-Parsing-Omni.
Figure 4: Overview of the Omni Parsing Framework. The framework transforms multimodal raw data into unified structured training data via three progressive stages: (1) L1-Holistic Detection: Performs spatio-temporal grounding and coarse classification; (2) L2-Fine-grained Recognition: Extracts detailed text, symbols, knowledge, attributes, and speech content; (3) L3-Multi-level Interpreting: Synthesizes local semantics with global logical reasoning. The final output is a standardized JSON format containing all parsing results from L1 to L3.
Figure 5: Qualitative examples illustrating the comprehensive audio parsing capability of Logics-Parsing-Omni.
...and 12 more figures

Logics-Parsing-Omni Technical Report

TL;DR

Abstract

Logics-Parsing-Omni Technical Report

Authors

TL;DR

Abstract

Table of Contents

Figures (17)