Table of Contents
Fetching ...

Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown

Changxu Duan

TL;DR

This work addresses the inefficiency of end-to-end PDF-to-Markdown transformers by introducing EditTrans, a hybrid editing-generation framework that first identifies copyable text via a layout-aware Editable Text Identification module, then builds an edit queue to guide targeted Fill in the Middle generation. By leveraging a lightweight ERNIE-Layout-based classifier, an edit-queue mechanism, and FIM-powered generation, EditTrans reduces transformation latency while preserving or improving output quality across multiple backbone models and domains. The authors validate their approach on a large arXiv-derived dataset and demonstrate substantial latency savings (up to 44.5%) with minimal quality loss, releasing code and dataset-building scripts for reproducibility. This approach enables scalable, accessible, and machine-actionable scholarly content, aligning with FAIR principles and offering a modular path for future enhancements such as figure embedding and advanced layout guidance.

Abstract

Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid editing-generation model whose features allow identifying a queue of to-be-edited text from a PDF before starting to generate markup language. EditTrans contains a lightweight classifier fine-tuned from a Document Layout Analysis model on 162,127 pages of documents from arXiv. In our evaluations, EditTrans reduced the transformation latency up to 44.5% compared to end-to-end decoder transformer models, while maintaining transformation quality. Our code and reproducible dataset production scripts are open-sourced.

Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown

TL;DR

This work addresses the inefficiency of end-to-end PDF-to-Markdown transformers by introducing EditTrans, a hybrid editing-generation framework that first identifies copyable text via a layout-aware Editable Text Identification module, then builds an edit queue to guide targeted Fill in the Middle generation. By leveraging a lightweight ERNIE-Layout-based classifier, an edit-queue mechanism, and FIM-powered generation, EditTrans reduces transformation latency while preserving or improving output quality across multiple backbone models and domains. The authors validate their approach on a large arXiv-derived dataset and demonstrate substantial latency savings (up to 44.5%) with minimal quality loss, releasing code and dataset-building scripts for reproducibility. This approach enables scalable, accessible, and machine-actionable scholarly content, aligning with FAIR principles and offering a modular path for future enhancements such as figure embedding and advanced layout guidance.

Abstract

Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid editing-generation model whose features allow identifying a queue of to-be-edited text from a PDF before starting to generate markup language. EditTrans contains a lightweight classifier fine-tuned from a Document Layout Analysis model on 162,127 pages of documents from arXiv. In our evaluations, EditTrans reduced the transformation latency up to 44.5% compared to end-to-end decoder transformer models, while maintaining transformation quality. Our code and reproducible dataset production scripts are open-sourced.

Paper Structure

This paper contains 34 sections, 4 figures, 3 tables, 1 algorithm.

Figures (4)

  • Figure 1: The EditTrans workflow. The Editable Text Identification module detects whether the span is copyable or not. Edit Queue Building module builds an edit queue, where initiates generation. Edit Actions Execution module executes edits: the blue part is generated by a backbone VLM with the FIM paradigm.
  • Figure 2: Execution time overhead for each module. where ET stands for EditTrans. Generation Step in w/ ET represents the Fill Markdown In the Middle step described in Section \ref{['sec:eae']}, and in w/o ET represents the inference step of the model. As can be seen from the figure, the module of EditTrans adds an extra portion of latency ($\sim$0.04 seconds), but it is insignificant compared to the time it saves in the Generation Step.
  • Figure 3: Edge case example encountered during the construction of the editing action queue. The green boxes represent text spans, and the numbers indicate the reading order predicted by PyMuPDF4LLM. In this example, the text columns on the left should have been read from top to bottom, followed by the columns on the right. But their reading order is disrupted. Causing EditTrans becomes ineffective in this scenario.
  • Figure 4: Example of Edit Trans processing inline formulas. Step 1 detects whether the span is copyable or not. Step 2 builds an edit queue, where initiates generation and a serves as a stop signal. Step 3 executes edits: the green part is copied, while the yellow part is generated by a backbone model with the Fill in the Middle paradigm.