Table of Contents
Fetching ...

Uni-Parser Technical Report

Xi Fang, Haoyi Tao, Shuwen Yang, Suyang Zhong, Haocheng Lu, Han Lyu, Chaozheng Huang, Xinyu Li, Linfeng Zhang, Guolin Ke

TL;DR

Uni-Parser addresses the challenge of industrial-scale parsing of scientific PDFs and patents by deploying a modular, multi-expert architecture that preserves cross-modal alignments across text, formulas, tables, figures, and chemical structures. It introduces a group-based layout detection framework, modular modalities (OCR, table, formula, chemical, chart), and a distributed, pipeline-parallel infrastructure to achieve high throughput at scale. Key contributions include Uni-Parser-LD for layout, SLANet for table structures, MolParser 1.5 for chemical structure recognition, SciParser for figure captions, and a data-flywheel data engine with Uni-Miner for human-in-the-loop data curation. The system demonstrates scalable performance (billions of pages) and enables downstream AI4Science tasks, including large-scale data generation for foundation models and robust domain-specific knowledge bases.

Abstract

This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.

Uni-Parser Technical Report

TL;DR

Uni-Parser addresses the challenge of industrial-scale parsing of scientific PDFs and patents by deploying a modular, multi-expert architecture that preserves cross-modal alignments across text, formulas, tables, figures, and chemical structures. It introduces a group-based layout detection framework, modular modalities (OCR, table, formula, chemical, chart), and a distributed, pipeline-parallel infrastructure to achieve high throughput at scale. Key contributions include Uni-Parser-LD for layout, SLANet for table structures, MolParser 1.5 for chemical structure recognition, SciParser for figure captions, and a data-flywheel data engine with Uni-Miner for human-in-the-loop data curation. The system demonstrates scalable performance (billions of pages) and enables downstream AI4Science tasks, including large-scale data generation for foundation models and robust domain-specific knowledge bases.

Abstract

This technical report introduces Uni-Parser, an industrial-grade document parsing engine tailored for scientific literature and patents, delivering high throughput, robust accuracy, and cost efficiency. Unlike pipeline-based document parsing methods, Uni-Parser employs a modular, loosely coupled multi-expert architecture that preserves fine-grained cross-modal alignments across text, equations, tables, figures, and chemical structures, while remaining easily extensible to emerging modalities. The system incorporates adaptive GPU load balancing, distributed inference, dynamic module orchestration, and configurable modes that support either holistic or modality-specific parsing. Optimized for large-scale cloud deployment, Uni-Parser achieves a processing rate of up to 20 PDF pages per second on 8 x NVIDIA RTX 4090D GPUs, enabling cost-efficient inference across billions of pages. This level of scalability facilitates a broad spectrum of downstream applications, ranging from literature retrieval and summarization to the extraction of chemical structures, reaction schemes, and bioactivity data, as well as the curation of large-scale corpora for training next-generation large language models and AI4Science models.

Paper Structure

This paper contains 29 sections, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Sketch of the Uni-Parser pipeline. Uni-Parser converts unstructured PDFs into clean, hierarchical, multimodal outputs (text, formulas, tables, figures, and chemical structures). These enriched representations are designed to be readily consumed by large language models, enabling more accurate understanding, reasoning, and document-level operations.
  • Figure 2: An inference example of group-based layout detection used Uni-Parser-LD. The final output is a hierarchical layout tree structure.
  • Figure 3: An example of the OCR model inference workflow in Uni-Parser. When a top-layer layout element overlaps a bottom-layer layout block, the system substitutes it with a placeholder before performing OCR. The placeholder is then resolved during post-processing, enabling fast and accurate multimodal parsing.
  • Figure 4: An example of table structure recognition results produced by Uni-Parser. By decoupling table structure recognition from table content recognition, the system achieves improved robustness, supports multimodal nesting within tables.
  • Figure 5: Examples of multi-modal recognition results produced by Uni-Parser. (a) Molecular structures are correctly associated with their corresponding identifiers. (b) Mathematical formulas are accurately linked to their formula IDs. (c) An organic chemical reaction is parsed into a structured reactant--condition--product triplet.
  • ...and 6 more figures