Table of Contents
Fetching ...

DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization

Rui-Yang Ju, Kohei Yamashita, Hirotaka Kameko, Shinsuke Mori

TL;DR

This work introduces DKDS, the first public benchmark for degraded Kuzushiji documents with seals, targeting two tasks: text/seal detection and document binarization. The dataset is built from Genji Monogatari with added historical seals and expert-verified binarization ground-truth, and includes training/testing splits with easy and difficult test conditions. Baselines span traditional image processing, YOLO-based detection, and GAN-based binarization (including a cGAN), with an OCR evaluation demonstrating substantial improvements from binarization. The results highlight the challenging interplay between Kuzushiji characters and seals, and the dataset enables robust development of detection, binarization, and downstream OCR for degraded pre-modern Japanese documents.

Abstract

Kuzushiji, a pre-modern Japanese cursive script, can currently be read and understood by only a few thousand trained experts in Japan. With the rapid development of deep learning, researchers have begun applying Optical Character Recognition (OCR) techniques to transcribe Kuzushiji into modern Japanese. Although existing OCR methods perform well on clean pre-modern Japanese documents written in Kuzushiji, they often fail to consider various types of noise, such as document degradation and seals, which significantly affect recognition accuracy. To the best of our knowledge, no existing dataset specifically addresses these challenges. To address this gap, we introduce the Degraded Kuzushiji Documents with Seals (DKDS) dataset as a new benchmark for related tasks. We describe the dataset construction process, which required the assistance of a trained Kuzushiji expert, and define two benchmark tracks: (1) text and seal detection and (2) document binarization. For the text and seal detection track, we provide baseline results using several recent versions of the You Only Look Once (YOLO) models for detecting Kuzushiji characters and seals. For the document binarization track, we present baseline results from traditional binarization algorithms, traditional algorithms combined with K-means clustering, two state-of-the-art (SOTA) Generative Adversarial Network (GAN) methods, as well as our Conditional GAN (cGAN) baseline. The DKDS dataset and the implementation code for baseline methods are available at https://ruiyangju.github.io/DKDS.

DKDS: A Benchmark Dataset of Degraded Kuzushiji Documents with Seals for Detection and Binarization

TL;DR

This work introduces DKDS, the first public benchmark for degraded Kuzushiji documents with seals, targeting two tasks: text/seal detection and document binarization. The dataset is built from Genji Monogatari with added historical seals and expert-verified binarization ground-truth, and includes training/testing splits with easy and difficult test conditions. Baselines span traditional image processing, YOLO-based detection, and GAN-based binarization (including a cGAN), with an OCR evaluation demonstrating substantial improvements from binarization. The results highlight the challenging interplay between Kuzushiji characters and seals, and the dataset enables robust development of detection, binarization, and downstream OCR for degraded pre-modern Japanese documents.

Abstract

Kuzushiji, a pre-modern Japanese cursive script, can currently be read and understood by only a few thousand trained experts in Japan. With the rapid development of deep learning, researchers have begun applying Optical Character Recognition (OCR) techniques to transcribe Kuzushiji into modern Japanese. Although existing OCR methods perform well on clean pre-modern Japanese documents written in Kuzushiji, they often fail to consider various types of noise, such as document degradation and seals, which significantly affect recognition accuracy. To the best of our knowledge, no existing dataset specifically addresses these challenges. To address this gap, we introduce the Degraded Kuzushiji Documents with Seals (DKDS) dataset as a new benchmark for related tasks. We describe the dataset construction process, which required the assistance of a trained Kuzushiji expert, and define two benchmark tracks: (1) text and seal detection and (2) document binarization. For the text and seal detection track, we provide baseline results using several recent versions of the You Only Look Once (YOLO) models for detecting Kuzushiji characters and seals. For the document binarization track, we present baseline results from traditional binarization algorithms, traditional algorithms combined with K-means clustering, two state-of-the-art (SOTA) Generative Adversarial Network (GAN) methods, as well as our Conditional GAN (cGAN) baseline. The DKDS dataset and the implementation code for baseline methods are available at https://ruiyangju.github.io/DKDS.

Paper Structure

This paper contains 25 sections, 1 equation, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Comparison of Optical Character Recognition (OCR) results on Kuzushiji characters overlapping with seals between the original and binarized images. OCR was conducted using the "miwo" app clanuwat2021miwo. From left to right, the observed OCR errors include recognition of extra character, recognition of incorrect character, and misclassification of seal inscriptions as text.
  • Figure 2: DKDS dataset is the first collection of degraded pre-modern Japanese document images specifically designed to address the challenge of Kuzushiji characters overlapping with seals. Based on the dataset, we define two benchmark tracks: (1) Text and Seal Detection, and (2) Document Binarization.
  • Figure 3: Examples of raw Kuzushiji document data from the book Genji Monogatari (The Tale of Genji) genjimonogatari.
  • Figure 4: Examples of our collected imperial seals from the Qing dynasty, with backgrounds removed, which are used to simulate seal interference.
  • Figure 5: The overall workflow of the proposed DKDS dataset construction includes raw data collection, text and seal detection annotations, initial binarization ground-truth generation, verification, and manual correction. The initial binarization ground-truth was generated using a pre-trained binarization model trained on the DIBCO benchmarks, while the verification was conducted by a trained Kuzushiji expert.
  • ...and 6 more figures