Table of Contents
Fetching ...

AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing

Huawei Ji, Cheng Deng, Bo Xue, Zhouyang Jin, Jiaxin Ding, Xiaoying Gan, Luoyi Fu, Xinbing Wang, Chenghu Zhou

TL;DR

This work addresses the challenge of parsing academic literature, which contains diverse structured elements such as formulas, tables, lists, and embedded math. It introduces AceParse, an open-source dataset annotated in LaTeX that covers a broad spectrum of structured content, and AceParser, a Florence-2–based multimodal model trained end-to-end on AceParse. The approach combines a large-scale data synthesis pipeline with robust boundary detection and a multimodal encoder–decoder to produce LaTeX-markup outputs, achieving state-of-the-art accuracy with improvements of about 4.1 percentage points in F1 and 5 percentage points in Jaccard Similarity over prior methods. The work provides a foundation for data-centric advancement in scholarly document processing and offers practical open-source resources for researchers to develop more versatile academic-parsing systems.

Abstract

With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity, demonstrating the potential of multimodal models in academic literature parsing. Our dataset is available at https://github.com/JHW5981/AceParse.

AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing

TL;DR

This work addresses the challenge of parsing academic literature, which contains diverse structured elements such as formulas, tables, lists, and embedded math. It introduces AceParse, an open-source dataset annotated in LaTeX that covers a broad spectrum of structured content, and AceParser, a Florence-2–based multimodal model trained end-to-end on AceParse. The approach combines a large-scale data synthesis pipeline with robust boundary detection and a multimodal encoder–decoder to produce LaTeX-markup outputs, achieving state-of-the-art accuracy with improvements of about 4.1 percentage points in F1 and 5 percentage points in Jaccard Similarity over prior methods. The work provides a foundation for data-centric advancement in scholarly document processing and offers practical open-source resources for researchers to develop more versatile academic-parsing systems.

Abstract

With the development of data-centric AI, the focus has shifted from model-driven approaches to improving data quality. Academic literature, as one of the crucial types, is predominantly stored in PDF formats and needs to be parsed into texts before further processing. However, parsing diverse structured texts in academic literature remains challenging due to the lack of datasets that cover various text structures. In this paper, we introduce AceParse, the first comprehensive dataset designed to support the parsing of a wide range of structured texts, including formulas, tables, lists, algorithms, and sentences with embedded mathematical expressions. Based on AceParse, we fine-tuned a multimodal model, named AceParser, which accurately parses various structured texts within academic literature. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity, demonstrating the potential of multimodal models in academic literature parsing. Our dataset is available at https://github.com/JHW5981/AceParse.
Paper Structure (9 sections, 1 equation, 3 figures, 3 tables)

This paper contains 9 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (a) The construction process of the AceParse dataset consists of three stages: document collection, data synthesis, and boundary detection. (b) The network architecture of AceParser. Visual token embeddings and text token embeddings are concatenated into multimodal token embeddings, which are then processed by a BART-based multimodal encoder-decoder.
  • Figure 2: Statistics of the AceParse dataset. (a) The number of structured items comprising AceParse, including the number of structured sentences with embedded formulas, etc. (b) Frequency histogram indexed by the label character length. (c) Joint kernel density estimation of image dimensions.
  • Figure 3: Feature map and cross-attention matrices within the AceParser model. The matrices displayed on the right represent cross-attention from four layers of AceParser, with the top row showing attention before training and the bottom row after training. Structured text locations are highlighted with yellow ellipses.