Oracle-MNIST: a Dataset of Oracle Characters for Benchmarking Machine Learning Algorithms

Mei Wang; Weihong Deng

Oracle-MNIST: a Dataset of Oracle Characters for Benchmarking Machine Learning Algorithms

Mei Wang, Weihong Deng

TL;DR

The paper addresses the need for more realistic benchmarks beyond MNIST by introducing Oracle-MNIST, a dataset of 30,222 ancient oracle-bone characters across 10 classes with 28 by 28 grayscale images. The data maintain MNIST-compatible formatting and provide a straightforward conversion pipeline, enabling easy integration into existing ML workflows with train/test splits of 27,222 and 3,000 per class, respectively. Benchmark results reveal that classical ML methods underperform on Oracle-MNIST compared to MNIST and Fashion-MNIST, while CNNs reduce error to about 6.2%, leaving ample room for improvement and indicating the challenge posed by noise and stylistic variance. Overall, Oracle-MNIST offers a practical, hard benchmark for assessing robustness to degradation in historical-script recognition and can be readily adopted within standard ML tooling.

Abstract

We introduce the Oracle-MNIST dataset, comprising of 28$\times $28 grayscale images of 30,222 ancient characters from 10 categories, for benchmarking pattern classification, with particular challenges on image noise and distortion. The training set totally consists of 27,222 images, and the test set contains 300 images per class. Oracle-MNIST shares the same data format with the original MNIST dataset, allowing for direct compatibility with all existing classifiers and systems, but it constitutes a more challenging classification task than MNIST. The images of ancient characters suffer from 1) extremely serious and unique noises caused by three-thousand years of burial and aging and 2) dramatically variant writing styles by ancient Chinese, which all make them realistic for machine learning research. The dataset is freely available at https://github.com/wm-bupt/oracle-mnist.

Oracle-MNIST: a Dataset of Oracle Characters for Benchmarking Machine Learning Algorithms

TL;DR

Abstract

We introduce the Oracle-MNIST dataset, comprising of 28

28 grayscale images of 30,222 ancient characters from 10 categories, for benchmarking pattern classification, with particular challenges on image noise and distortion. The training set totally consists of 27,222 images, and the test set contains 300 images per class. Oracle-MNIST shares the same data format with the original MNIST dataset, allowing for direct compatibility with all existing classifiers and systems, but it constitutes a more challenging classification task than MNIST. The images of ancient characters suffer from 1) extremely serious and unique noises caused by three-thousand years of burial and aging and 2) dramatically variant writing styles by ancient Chinese, which all make them realistic for machine learning research. The dataset is freely available at https://github.com/wm-bupt/oracle-mnist.

Paper Structure (6 sections, 3 figures, 5 tables)

This paper contains 6 sections, 3 figures, 5 tables.

Introduction
Oracle-MNIST Dataset
Discovery of Oracle Characters
Details of Dataset
Experiments
Conclusions

Figures (3)

Figure 1: Oracle characters are the oldest hieroglyphs in China, which were inscribed on (a) oracle bones about 3000 years ago. (b) Despite the pictorial nature of oracle characters, it constitutes a fully functional and well-developed writing system.
Figure 2: (a) Example of scanned oracle inscription. (b-d) Examples of scanned oracle characters. Different writing styles lead to a high degree of intra-class variance and inter-class similarity.
Figure 3: Diagram of the conversion process used to generate Oracle-MNIST dataset. Two examples from 'sun' and 'not' categories are depicted, respectively.

Oracle-MNIST: a Dataset of Oracle Characters for Benchmarking Machine Learning Algorithms

TL;DR

Abstract

Oracle-MNIST: a Dataset of Oracle Characters for Benchmarking Machine Learning Algorithms

Authors

TL;DR

Abstract

Table of Contents

Figures (3)