Layout-Aware Text Editing for Efficient Transformation of Academic PDFs to Markdown
Changxu Duan
TL;DR
This work addresses the inefficiency of end-to-end PDF-to-Markdown transformers by introducing EditTrans, a hybrid editing-generation framework that first identifies copyable text via a layout-aware Editable Text Identification module, then builds an edit queue to guide targeted Fill in the Middle generation. By leveraging a lightweight ERNIE-Layout-based classifier, an edit-queue mechanism, and FIM-powered generation, EditTrans reduces transformation latency while preserving or improving output quality across multiple backbone models and domains. The authors validate their approach on a large arXiv-derived dataset and demonstrate substantial latency savings (up to 44.5%) with minimal quality loss, releasing code and dataset-building scripts for reproducibility. This approach enables scalable, accessible, and machine-actionable scholarly content, aligning with FAIR principles and offering a modular path for future enhancements such as figure embedding and advanced layout guidance.
Abstract
Academic documents stored in PDF format can be transformed into plain text structured markup languages to enhance accessibility and enable scalable digital library workflows. Markup languages allow for easier updates and customization, making academic content more adaptable and accessible to diverse usage, such as linguistic corpus compilation. Such documents, typically delivered in PDF format, contain complex elements including mathematical formulas, figures, headers, and tables, as well as densely layouted text. Existing end-to-end decoder transformer models can transform screenshots of documents into markup language. However, these models exhibit significant inefficiencies; their token-by-token decoding from scratch wastes a lot of inference steps in regenerating dense text that could be directly copied from PDF files. To solve this problem, we introduce EditTrans, a hybrid editing-generation model whose features allow identifying a queue of to-be-edited text from a PDF before starting to generate markup language. EditTrans contains a lightweight classifier fine-tuned from a Document Layout Analysis model on 162,127 pages of documents from arXiv. In our evaluations, EditTrans reduced the transformation latency up to 44.5% compared to end-to-end decoder transformer models, while maintaining transformation quality. Our code and reproducible dataset production scripts are open-sourced.
