Table of Contents
Fetching ...

Leveraging Semantic Segmentation Masks with Embeddings for Fine-Grained Form Classification

Taylor Archibald, Tony Martinez

TL;DR

This work tackles unsupervised, fine-grained form classification of historical documents, where subtle layout differences define form types rather than content. It introduces a pipeline that fuses semantic segmentation masks with embeddings from ResNet, CLIP, DiT, and MAE to emphasize document structure. Two new datasets, the French 19th-century Census and the U.S. 1950 Census, benchmark the approach, showing segmentation improves clustering and classification across models, with MAE benefits when trained on segmented inputs. The study establishes a new benchmark for unsupervised fine-grained document classification and points to future work integrating additional self-supervised methods and representation interpolations.

Abstract

Efficient categorization of historical documents is crucial for fields such as genealogy, legal research, and historical scholarship, where manual classification is impractical for large collections due to its labor-intensive and error-prone nature. To address this, we propose a representational learning strategy that integrates semantic segmentation and deep learning models such as ResNet, CLIP, Document Image Transformer (DiT), and masked auto-encoders (MAE), to generate embeddings that capture document features without predefined labels. To the best of our knowledge, we are the first to evaluate embeddings on fine-grained, unsupervised form classification. To improve these embeddings, we propose to first employ semantic segmentation as a preprocessing step. We contribute two novel datasets$\unicode{x2014}$the French 19th-century and U.S. 1950 Census records$\unicode{x2014}$to demonstrate our approach. Our results show the effectiveness of these various embedding techniques in distinguishing similar document types and indicate that applying semantic segmentation can greatly improve clustering and classification results. The census datasets are available at https://github.com/tahlor/census_forms

Leveraging Semantic Segmentation Masks with Embeddings for Fine-Grained Form Classification

TL;DR

This work tackles unsupervised, fine-grained form classification of historical documents, where subtle layout differences define form types rather than content. It introduces a pipeline that fuses semantic segmentation masks with embeddings from ResNet, CLIP, DiT, and MAE to emphasize document structure. Two new datasets, the French 19th-century Census and the U.S. 1950 Census, benchmark the approach, showing segmentation improves clustering and classification across models, with MAE benefits when trained on segmented inputs. The study establishes a new benchmark for unsupervised fine-grained document classification and points to future work integrating additional self-supervised methods and representation interpolations.

Abstract

Efficient categorization of historical documents is crucial for fields such as genealogy, legal research, and historical scholarship, where manual classification is impractical for large collections due to its labor-intensive and error-prone nature. To address this, we propose a representational learning strategy that integrates semantic segmentation and deep learning models such as ResNet, CLIP, Document Image Transformer (DiT), and masked auto-encoders (MAE), to generate embeddings that capture document features without predefined labels. To the best of our knowledge, we are the first to evaluate embeddings on fine-grained, unsupervised form classification. To improve these embeddings, we propose to first employ semantic segmentation as a preprocessing step. We contribute two novel datasetsthe French 19th-century and U.S. 1950 Census recordsto demonstrate our approach. Our results show the effectiveness of these various embedding techniques in distinguishing similar document types and indicate that applying semantic segmentation can greatly improve clustering and classification results. The census datasets are available at https://github.com/tahlor/census_forms
Paper Structure (13 sections, 4 figures, 4 tables)

This paper contains 13 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: An image of a U.S. 1950 Census form can be decomposed into different content classes using a model trained on DELINE8K. The handwriting has been tinted by class (handwriting=red, preprinted text=green, and grid lines=blue).
  • Figure 2: U.S. 1950 Census forms contain identical content with varying layouts, challenging language-centric models like CLIP to detect subtle differences.
  • Figure 3: French Census form (left) and its masked counterpart (right) from a model trained on the DELINE8K dataset.
  • Figure 4: Projection of the U.S. 1950 Census dataset using UMAP based on embeddings from CLIP-ViT-L/14-336 (left), ViT-MAE-448 (middle), and ViT-MAE-448 trained on segmented images (right).