Table of Contents
Fetching ...

JaPOC: Japanese Post-OCR Correction Benchmark using Vouchers

Masato Fujitake

TL;DR

JaPOC introduces a public benchmark for post-OCR correction of Japanese vouchers, focusing on correcting company-name errors caused by seal noise. The study evaluates OCR outputs from two services and tests end-to-end correction using T5-based sequence-to-sequence models (Megagon T5ja and Retrieva T5ja) as well as a Levenshtein-distance rule-based baseline. Results show that a well-chosen pre-trained language model fine-tuned on JaPOC data dramatically improves accuracy (average gains up to ~9–10 points, e.g., from ~85.4% to ~94.8%), while the rule-based method provides a solid, model-free alternative. The benchmark and baselines offer a practical path for deploying robust post-OCR correction in automated processing of corporate vouchers, with performance dependent on the underlying OCR quality and error distributions.

Abstract

In this paper, we create benchmarks and assess the effectiveness of error correction methods for Japanese vouchers in OCR (Optical Character Recognition) systems. It is essential for automation processing to correctly recognize scanned voucher text, such as the company name on invoices. However, perfect recognition is complex due to the noise, such as stamps. Therefore, it is crucial to correctly rectify erroneous OCR results. However, no publicly available OCR error correction benchmarks for Japanese exist, and methods have not been adequately researched. In this study, we measured text recognition accuracy by existing services on Japanese vouchers and developed a post-OCR correction benchmark. Then, we proposed simple baselines for error correction using language models and verified whether the proposed method could effectively correct these errors. In the experiments, the proposed error correction algorithm significantly improved overall recognition accuracy.

JaPOC: Japanese Post-OCR Correction Benchmark using Vouchers

TL;DR

JaPOC introduces a public benchmark for post-OCR correction of Japanese vouchers, focusing on correcting company-name errors caused by seal noise. The study evaluates OCR outputs from two services and tests end-to-end correction using T5-based sequence-to-sequence models (Megagon T5ja and Retrieva T5ja) as well as a Levenshtein-distance rule-based baseline. Results show that a well-chosen pre-trained language model fine-tuned on JaPOC data dramatically improves accuracy (average gains up to ~9–10 points, e.g., from ~85.4% to ~94.8%), while the rule-based method provides a solid, model-free alternative. The benchmark and baselines offer a practical path for deploying robust post-OCR correction in automated processing of corporate vouchers, with performance dependent on the underlying OCR quality and error distributions.

Abstract

In this paper, we create benchmarks and assess the effectiveness of error correction methods for Japanese vouchers in OCR (Optical Character Recognition) systems. It is essential for automation processing to correctly recognize scanned voucher text, such as the company name on invoices. However, perfect recognition is complex due to the noise, such as stamps. Therefore, it is crucial to correctly rectify erroneous OCR results. However, no publicly available OCR error correction benchmarks for Japanese exist, and methods have not been adequately researched. In this study, we measured text recognition accuracy by existing services on Japanese vouchers and developed a post-OCR correction benchmark. Then, we proposed simple baselines for error correction using language models and verified whether the proposed method could effectively correct these errors. In the experiments, the proposed error correction algorithm significantly improved overall recognition accuracy.
Paper Structure (14 sections, 1 figure, 5 tables)

This paper contains 14 sections, 1 figure, 5 tables.

Figures (1)

  • Figure 1: Sample examples of Japanese text recognition images in vouchers. Due to the convention of stamping seals on vouchers, text images of vouchers tend to be difficult for OCR to read.