Table of Contents
Fetching ...

CENSUS-HWR: a large training dataset for offline handwriting recognition

Chetan Joshi, Lawry Sorenson, Ammon Wolfert, Mark Clement, Joseph Price, Kasey Buckles

TL;DR

This paper introduces CENSUS-HWR, a large-scale, natural offline handwriting dataset derived from US census forms (1930 and 1940), featuring about 1.8 million word images, 1.86 million word instances, a 10,711-word vocabulary, and roughly 70,000 authors. It details a pipeline that extracts and segments handwritten words using SIFT and RANSAC against form templates, followed by a crowd-sourced labeling workflow including reverse indexing to improve transcription accuracy. A Bluche-style gated convolution model with bi-directional RNNs and CTC loss is provided and evaluated, achieving a mean character error rate of approximately 4.65% in cross-validation and accompanied by downloadable weights. The dataset emphasizes natural handwriting noise and diversity, aiming to enable robust, real-world HWR models and more rigorous evaluation beyond traditional, clean datasets.

Abstract

Progress in Automated Handwriting Recognition has been hampered by the lack of large training datasets. Nearly all research uses a set of small datasets that often cause models to overfit. We present CENSUS-HWR, a new dataset consisting of full English handwritten words in 1,812,014 gray scale images. A total of 1,865,134 handwritten texts from a vocabulary of 10,711 words in the English language are present in this collection. This dataset is intended to serve handwriting models as a benchmark for deep learning algorithms. This huge English handwriting recognition dataset has been extracted from the US 1930 and 1940 censuses taken by approximately 70,000 enumerators each year. The dataset and the trained model with their weights are freely available to download at https://censustree.org/data.html.

CENSUS-HWR: a large training dataset for offline handwriting recognition

TL;DR

This paper introduces CENSUS-HWR, a large-scale, natural offline handwriting dataset derived from US census forms (1930 and 1940), featuring about 1.8 million word images, 1.86 million word instances, a 10,711-word vocabulary, and roughly 70,000 authors. It details a pipeline that extracts and segments handwritten words using SIFT and RANSAC against form templates, followed by a crowd-sourced labeling workflow including reverse indexing to improve transcription accuracy. A Bluche-style gated convolution model with bi-directional RNNs and CTC loss is provided and evaluated, achieving a mean character error rate of approximately 4.65% in cross-validation and accompanied by downloadable weights. The dataset emphasizes natural handwriting noise and diversity, aiming to enable robust, real-world HWR models and more rigorous evaluation beyond traditional, clean datasets.

Abstract

Progress in Automated Handwriting Recognition has been hampered by the lack of large training datasets. Nearly all research uses a set of small datasets that often cause models to overfit. We present CENSUS-HWR, a new dataset consisting of full English handwritten words in 1,812,014 gray scale images. A total of 1,865,134 handwritten texts from a vocabulary of 10,711 words in the English language are present in this collection. This dataset is intended to serve handwriting models as a benchmark for deep learning algorithms. This huge English handwriting recognition dataset has been extracted from the US 1930 and 1940 censuses taken by approximately 70,000 enumerators each year. The dataset and the trained model with their weights are freely available to download at https://censustree.org/data.html.
Paper Structure (8 sections, 7 figures, 3 tables)

This paper contains 8 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: A bar graph of word count comparing the CENSUS-HWR dataset (BYUHWR) with IAM and RIMES dataset.
  • Figure 2: A bar graph of vocabulary size comparing the CENSUS-HWR (BYUHWR) dataset with IAM and RIMES dataset.
  • Figure 3: A bar graph of number of authors comparing the CENSUS-HWR (BYUHWR) dataset with IAM and RIMES dataset.
  • Figure 4: Handwriting data samples from the census images.
  • Figure 5: Example of US 1930 Census
  • ...and 2 more figures