Table of Contents
Fetching ...

Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition

Thao Do, Dinh Phu Tran, An Vo, Daeyoung Kim

TL;DR

The paper tackles robust OCR for historical diacritic languages, focusing on Vietnamese. It introduces a reference-based post-OCR correction pipeline that leverages content-focused ebooks and large language models to generate high-precision pseudo ground truth without extra annotation. It also introduces VieBookRead, a large classical Vietnamese book dataset, and demonstrates superior quality over transformer-based spell correction and competitive OCR baselines. The work enables better page-to-page datasets and has practical impact for preserving and analyzing historical texts in diacritic languages.

Abstract

Extracting fine-grained OCR text from aged documents in diacritic languages remains challenging due to unexpected artifacts, time-induced degradation, and lack of datasets. While standalone spell correction approaches have been proposed, they show limited performance for historical documents due to numerous possible OCR error combinations and differences between modern and classical corpus distributions. We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text, supported by large language models. This technique generates high-precision pseudo-page-to-page labels for diacritic languages, where small strokes pose significant challenges in historical conditions. The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences. Our post-processing method, which generated a large OCR dataset of classical Vietnamese books, achieved a mean grading score of 8.72 on a 10-point scale. This outperformed the state-of-the-art transformer-based Vietnamese spell correction model, which scored 7.03 when evaluated on a sampled subset of the dataset. We also trained a baseline OCR model to assess and compare it with well-known engines. Experimental results demonstrate the strength of our baseline model compared to widely used open-source solutions. The resulting dataset will be released publicly to support future studies.

Reference-Based Post-OCR Processing with LLM for Precise Diacritic Text in Historical Document Recognition

TL;DR

The paper tackles robust OCR for historical diacritic languages, focusing on Vietnamese. It introduces a reference-based post-OCR correction pipeline that leverages content-focused ebooks and large language models to generate high-precision pseudo ground truth without extra annotation. It also introduces VieBookRead, a large classical Vietnamese book dataset, and demonstrates superior quality over transformer-based spell correction and competitive OCR baselines. The work enables better page-to-page datasets and has practical impact for preserving and analyzing historical texts in diacritic languages.

Abstract

Extracting fine-grained OCR text from aged documents in diacritic languages remains challenging due to unexpected artifacts, time-induced degradation, and lack of datasets. While standalone spell correction approaches have been proposed, they show limited performance for historical documents due to numerous possible OCR error combinations and differences between modern and classical corpus distributions. We propose a method utilizing available content-focused ebooks as a reference base to correct imperfect OCR-generated text, supported by large language models. This technique generates high-precision pseudo-page-to-page labels for diacritic languages, where small strokes pose significant challenges in historical conditions. The pipeline eliminates various types of noise from aged documents and addresses issues such as missing characters, words, and disordered sequences. Our post-processing method, which generated a large OCR dataset of classical Vietnamese books, achieved a mean grading score of 8.72 on a 10-point scale. This outperformed the state-of-the-art transformer-based Vietnamese spell correction model, which scored 7.03 when evaluated on a sampled subset of the dataset. We also trained a baseline OCR model to assess and compare it with well-known engines. Experimental results demonstrate the strength of our baseline model compared to widely used open-source solutions. The resulting dataset will be released publicly to support future studies.

Paper Structure

This paper contains 21 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: TF-IDF differences between a modern corpus vs our classic set book content
  • Figure 2: t-SNE visualization for Ours vs ViTextVQA, NomNaOCR, Vi-BCI using features by CLIP ViT-L/14@336px image encoder
  • Figure 3: Our main pipeline to generate precise diacritic text for book images
  • Figure 4: A sample OCR extraction by Azure DI: (a) the original page, (b) page visualized with paragraphs, (c) OCR extracted text in JSON form
  • Figure 5: Noise Suppression by Heuristic Filters
  • ...and 2 more figures