JaPOC: Japanese Post-OCR Correction Benchmark using Vouchers
Masato Fujitake
TL;DR
JaPOC introduces a public benchmark for post-OCR correction of Japanese vouchers, focusing on correcting company-name errors caused by seal noise. The study evaluates OCR outputs from two services and tests end-to-end correction using T5-based sequence-to-sequence models (Megagon T5ja and Retrieva T5ja) as well as a Levenshtein-distance rule-based baseline. Results show that a well-chosen pre-trained language model fine-tuned on JaPOC data dramatically improves accuracy (average gains up to ~9–10 points, e.g., from ~85.4% to ~94.8%), while the rule-based method provides a solid, model-free alternative. The benchmark and baselines offer a practical path for deploying robust post-OCR correction in automated processing of corporate vouchers, with performance dependent on the underlying OCR quality and error distributions.
Abstract
In this paper, we create benchmarks and assess the effectiveness of error correction methods for Japanese vouchers in OCR (Optical Character Recognition) systems. It is essential for automation processing to correctly recognize scanned voucher text, such as the company name on invoices. However, perfect recognition is complex due to the noise, such as stamps. Therefore, it is crucial to correctly rectify erroneous OCR results. However, no publicly available OCR error correction benchmarks for Japanese exist, and methods have not been adequately researched. In this study, we measured text recognition accuracy by existing services on Japanese vouchers and developed a post-OCR correction benchmark. Then, we proposed simple baselines for error correction using language models and verified whether the proposed method could effectively correct these errors. In the experiments, the proposed error correction algorithm significantly improved overall recognition accuracy.
