Table of Contents
Fetching ...

Optimizing Nepali PDF Extraction: A Comparative Study of Parser and OCR Technologies

Prabin Paudel, Supriya Khadka, Ranju G. C., Rahul Shah

TL;DR

The study addresses the challenge of extracting Nepali text from PDFs by comparing PDF parsing and OCR approaches. It evaluates fast, code-driven parsing against OCR methods (PyTesseract and EasyOCR) across PDFs with Unicode and non-Unicode Nepali fonts, as well as image-embedded content. Findings show that parsers are faster but sensitive to font encoding and image content, while OCR offers more consistent accuracy at the expense of speed; PyTesseract provides the best overall speed-accuracy trade-off for Nepali PDFs. The work offers practical guidance for selecting robust text extraction pipelines in low-resource language contexts and informs workflow design for Nepali document processing.

Abstract

This research compares PDF parsing and Optical Character Recognition (OCR) methods for extracting Nepali content from PDFs. PDF parsing offers fast and accurate extraction but faces challenges with non-Unicode Nepali fonts. OCR, specifically PyTesseract, overcomes these challenges, providing versatility for both digital and scanned PDFs. The study reveals that while PDF parsers are faster, their accuracy fluctuates based on PDF types. In contrast, OCRs, with a focus on PyTesseract, demonstrate consistent accuracy at the expense of slightly longer extraction times. Considering the project's emphasis on Nepali PDFs, PyTesseract emerges as the most suitable library, balancing extraction speed and accuracy.

Optimizing Nepali PDF Extraction: A Comparative Study of Parser and OCR Technologies

TL;DR

The study addresses the challenge of extracting Nepali text from PDFs by comparing PDF parsing and OCR approaches. It evaluates fast, code-driven parsing against OCR methods (PyTesseract and EasyOCR) across PDFs with Unicode and non-Unicode Nepali fonts, as well as image-embedded content. Findings show that parsers are faster but sensitive to font encoding and image content, while OCR offers more consistent accuracy at the expense of speed; PyTesseract provides the best overall speed-accuracy trade-off for Nepali PDFs. The work offers practical guidance for selecting robust text extraction pipelines in low-resource language contexts and informs workflow design for Nepali document processing.

Abstract

This research compares PDF parsing and Optical Character Recognition (OCR) methods for extracting Nepali content from PDFs. PDF parsing offers fast and accurate extraction but faces challenges with non-Unicode Nepali fonts. OCR, specifically PyTesseract, overcomes these challenges, providing versatility for both digital and scanned PDFs. The study reveals that while PDF parsers are faster, their accuracy fluctuates based on PDF types. In contrast, OCRs, with a focus on PyTesseract, demonstrate consistent accuracy at the expense of slightly longer extraction times. Considering the project's emphasis on Nepali PDFs, PyTesseract emerges as the most suitable library, balancing extraction speed and accuracy.
Paper Structure (8 sections, 2 figures, 3 tables)

This paper contains 8 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Extraction of Unicode Incompatible Font
  • Figure 2: Comparison graph of different extraction methods