Table of Contents
Fetching ...

A Comparative Study of PDF Parsing Tools Across Diverse Document Categories

Narayan S. Adhikari, Shradha Agarwal

TL;DR

The paper tackles the challenge of cross-domain PDF parsing by conducting a comprehensive comparison of 10 open-source tools across six DocLayNet categories, focusing on text extraction and table detection. It combines rule-based parsers and learning-based approaches (notably Nougat and Table Transformer TATR) and uses a DocLayNet-derived ground truth with multi-metric evaluation, including Levenshtein-based F1, BLEU-4, and local alignment for text, plus IoU/Jaccard for tables. Key findings show PyMuPDF and pypdfium excel in general text extraction, while scientific and patent documents benefit from learning-based methods; for tables, TATR offers strong cross-domain recall, with Camelot and Tabula performing best in specific categories. The study provides practical guidance on tool selection by document type and highlights opportunities for hybrid approaches and future improvements in handling complex scientific and tabular content.

Abstract

PDF is one of the most prominent data formats, making PDF parsing crucial for information extraction and retrieval, particularly with the rise of RAG systems. While various PDF parsing tools exist, their effectiveness across different document types remains understudied, especially beyond academic papers. Our research aims to address this gap by comparing 10 popular PDF parsing tools across 6 document categories using the DocLayNet dataset. These tools include PyPDF, pdfminer-six, PyMuPDF, pdfplumber, pypdfium2, Unstructured, Tabula, Camelot, as well as the deep learning-based tools Nougat and Table Transformer(TATR). We evaluated both text extraction and table detection capabilities. For text extraction, PyMuPDF and pypdfium generally outperformed others, but all parsers struggled with Scientific and Patent documents. For these challenging categories, learning-based tools like Nougat demonstrated superior performance. In table detection, TATR excelled in the Financial, Patent, Law & Regulations, and Scientific categories. Table detection tool Camelot performed best for tender documents, while PyMuPDF performed superior in the Manual category. Our findings highlight the importance of selecting appropriate parsing tools based on document type and specific tasks, providing valuable insights for researchers and practitioners working with diverse document sources.

A Comparative Study of PDF Parsing Tools Across Diverse Document Categories

TL;DR

The paper tackles the challenge of cross-domain PDF parsing by conducting a comprehensive comparison of 10 open-source tools across six DocLayNet categories, focusing on text extraction and table detection. It combines rule-based parsers and learning-based approaches (notably Nougat and Table Transformer TATR) and uses a DocLayNet-derived ground truth with multi-metric evaluation, including Levenshtein-based F1, BLEU-4, and local alignment for text, plus IoU/Jaccard for tables. Key findings show PyMuPDF and pypdfium excel in general text extraction, while scientific and patent documents benefit from learning-based methods; for tables, TATR offers strong cross-domain recall, with Camelot and Tabula performing best in specific categories. The study provides practical guidance on tool selection by document type and highlights opportunities for hybrid approaches and future improvements in handling complex scientific and tabular content.

Abstract

PDF is one of the most prominent data formats, making PDF parsing crucial for information extraction and retrieval, particularly with the rise of RAG systems. While various PDF parsing tools exist, their effectiveness across different document types remains understudied, especially beyond academic papers. Our research aims to address this gap by comparing 10 popular PDF parsing tools across 6 document categories using the DocLayNet dataset. These tools include PyPDF, pdfminer-six, PyMuPDF, pdfplumber, pypdfium2, Unstructured, Tabula, Camelot, as well as the deep learning-based tools Nougat and Table Transformer(TATR). We evaluated both text extraction and table detection capabilities. For text extraction, PyMuPDF and pypdfium generally outperformed others, but all parsers struggled with Scientific and Patent documents. For these challenging categories, learning-based tools like Nougat demonstrated superior performance. In table detection, TATR excelled in the Financial, Patent, Law & Regulations, and Scientific categories. Table detection tool Camelot performed best for tender documents, while PyMuPDF performed superior in the Manual category. Our findings highlight the importance of selecting appropriate parsing tools based on document type and specific tasks, providing valuable insights for researchers and practitioners working with diverse document sources.

Paper Structure

This paper contains 13 sections, 9 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Distribution of document categories in DocLaynet DatasetPfitzmann2022
  • Figure 2: Example of ground truth generation from the JSON file loaded into a dataframe. Content from the 'text' column was extracted, and new lines and spaces were added according to the 'category' column.
  • Figure 3: Comparison of PDF parser outputs against ground truth data. Both JSON and PDF files are processed to produce ground truth text and extracted text from PDF parsers. Both outputs are saved in the tokenized and combined format before being evaluated using metrics such as F1 score, BLEU, and Local Alignment.
  • Figure 4: Similarity matrix is generated by calculating the normalized Levenshtein similarity between tokenized GT(Ground truth) and ET(Extracted text). if the value is greater than the threshold(colored) it is counted as 1. Here $\sum_{i=1}^{3}\sum_{j=1}^{3} TP_{i,j} =2$
  • Figure 5: An example of Local alignment score calculation for two strings. First, we define the matching score, mismatch, and Gap penalty. For these two strings, the local alignment score is 4 and the normalized local alignment score is 0.67.
  • ...and 6 more figures