Table of Contents
Fetching ...

Enhancing OCR for Sino-Vietnamese Language Processing via Fine-tuned PaddleOCRv5

Minh Hoang Nguyen, Su Nguyen Thiet

TL;DR

The paper tackles OCR for Sino–Vietnamese Han–Nom texts found in degraded historical Vietnamese manuscripts, addressing a gap where existing models struggle with noise, non-standard glyphs, and handwriting variance.It introduces a fine-tuning pipeline for PaddleOCRv5_rec using a teacher–student distillation setup (GTC-NRTR as teacher, SVTR-HGNet as student) and a full training workflow with preprocessing, LMDB conversion, evaluation, and visualization.Using a sizable synthetic dataset subset (approx. 400k training and 100k evaluation samples) derived from a large Chinese document OCR corpus, the approach yields substantial gains: exact accuracy climbs from 37.5% to 50.0% and average confidence from 81.3% to 91.1%.An interactive demo on HuggingFace Spaces demonstrates practical benefits for downstream tasks such as Han–Vietnamese semantic alignment, translation, and historical linguistics research, with future work proposed on data expansion, two-stage detection, semantic extraction, and cross-model comparisons.

Abstract

Recognizing and processing Classical Chinese (Han-Nom) texts play a vital role in digitizing Vietnamese historical documents and enabling cross-lingual semantic research. However, existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations common in ancient sources. In this work, we propose a fine-tuning approach for PaddleOCRv5 to improve character recognition on Han-Nom texts. We retrain the text recognition module using a curated subset of ancient Vietnamese Chinese manuscripts, supported by a full training pipeline covering preprocessing, LMDB conversion, evaluation, and visualization. Experimental results show a significant improvement over the base model, with exact accuracy increasing from 37.5 percent to 50.0 percent, particularly under noisy image conditions. Furthermore, we develop an interactive demo that visually compares pre- and post-fine-tuning recognition results, facilitating downstream applications such as Han-Vietnamese semantic alignment, machine translation, and historical linguistics research. The demo is available at https://huggingface.co/spaces/MinhDS/Fine-tuned-PaddleOCRv5

Enhancing OCR for Sino-Vietnamese Language Processing via Fine-tuned PaddleOCRv5

TL;DR

The paper tackles OCR for Sino–Vietnamese Han–Nom texts found in degraded historical Vietnamese manuscripts, addressing a gap where existing models struggle with noise, non-standard glyphs, and handwriting variance.It introduces a fine-tuning pipeline for PaddleOCRv5_rec using a teacher–student distillation setup (GTC-NRTR as teacher, SVTR-HGNet as student) and a full training workflow with preprocessing, LMDB conversion, evaluation, and visualization.Using a sizable synthetic dataset subset (approx. 400k training and 100k evaluation samples) derived from a large Chinese document OCR corpus, the approach yields substantial gains: exact accuracy climbs from 37.5% to 50.0% and average confidence from 81.3% to 91.1%.An interactive demo on HuggingFace Spaces demonstrates practical benefits for downstream tasks such as Han–Vietnamese semantic alignment, translation, and historical linguistics research, with future work proposed on data expansion, two-stage detection, semantic extraction, and cross-model comparisons.

Abstract

Recognizing and processing Classical Chinese (Han-Nom) texts play a vital role in digitizing Vietnamese historical documents and enabling cross-lingual semantic research. However, existing OCR systems struggle with degraded scans, non-standard glyphs, and handwriting variations common in ancient sources. In this work, we propose a fine-tuning approach for PaddleOCRv5 to improve character recognition on Han-Nom texts. We retrain the text recognition module using a curated subset of ancient Vietnamese Chinese manuscripts, supported by a full training pipeline covering preprocessing, LMDB conversion, evaluation, and visualization. Experimental results show a significant improvement over the base model, with exact accuracy increasing from 37.5 percent to 50.0 percent, particularly under noisy image conditions. Furthermore, we develop an interactive demo that visually compares pre- and post-fine-tuning recognition results, facilitating downstream applications such as Han-Vietnamese semantic alignment, machine translation, and historical linguistics research. The demo is available at https://huggingface.co/spaces/MinhDS/Fine-tuned-PaddleOCRv5

Paper Structure

This paper contains 20 sections, 1 equation, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Examples of OCR errors in ancient Han–Nom texts: noisy characters, missing lines, and sentence merging.
  • Figure 2: An example of an alignment error caused by incorrect character mapping due to OCR mistakes.
  • Figure 3: Fine-tuning pipeline of PaddleOCRv5 for character recognition on ancient Han–Nom documents.
  • Figure 4: Inference pipeline of the fine-tuned PaddleOCRv5 model for ancient Han–Nom documents.
  • Figure 5: Examples of preprocessed Han–Nom text-line samples used for fine-tuning.
  • ...and 3 more figures