Table of Contents
Fetching ...

Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation

Lukas Arzoumanidis, Julius Knechtel, Jan-Henrik Haunert, Youness Dehbi

TL;DR

This work tackles the data scarcity challenge in historical map analysis by bootstrapping synthetic maps that inherit cartographic style from a homogeneous map corpus and are grounded in real vector data. It combines unpaired style-transfer via CycleGAN and diffusion-based approaches (UNSB/Stable Diffusion) with explicit data-dependent uncertainty degradation to produce large, labeled training sets, then evaluates them through domain-adaptive semantic segmentation using Self-Constructing Graph Convolutional Networks. Key findings show that generative degradation strategies yield high realism (low FID) and strong segmentation performance on original maps, with DLCycleGAN offering the best overall metrics among the tested approaches. The results suggest a practical route to scalable, automated interpretation of vast historical map archives, potentially enabling widespread land-cover analysis without manual annotation.

Abstract

The automated analysis of historical documents, particularly maps, has drastically benefited from advances in deep learning and its success across various computer vision applications. However, most deep learning-based methods heavily rely on large amounts of annotated training data, which are typically unavailable for historical maps, especially for those belonging to specific, homogeneous cartographic domains, also known as corpora. Creating high-quality training data suitable for machine learning often takes a significant amount of time and involves extensive manual effort. While synthetic training data can alleviate the scarcity of real-world samples, it often lacks the affinity (realism) and diversity (variation) necessary for effective learning. By transferring the cartographic style of an original historical map corpus onto vector data, we bootstrap an effectively unlimited number of synthetic historical maps suitable for tasks such as land-cover interpretation of a homogeneous historical map corpus. We propose an automatic deep generative approach and a alternative manual stochastic degradation technique to emulate the visual uncertainty and noise, also known as data-dependent uncertainty, commonly observed in historical map scans. To quantitatively evaluate the effectiveness and applicability of our approach, the generated training datasets were employed for domain-adaptive semantic segmentation on a homogeneous map corpus using a Self-Constructing Graph Convolutional Network, enabling a comprehensive assessment of the impact of our data bootstrapping methods.

Automatic Uncertainty-Aware Synthetic Data Bootstrapping for Historical Map Segmentation

TL;DR

This work tackles the data scarcity challenge in historical map analysis by bootstrapping synthetic maps that inherit cartographic style from a homogeneous map corpus and are grounded in real vector data. It combines unpaired style-transfer via CycleGAN and diffusion-based approaches (UNSB/Stable Diffusion) with explicit data-dependent uncertainty degradation to produce large, labeled training sets, then evaluates them through domain-adaptive semantic segmentation using Self-Constructing Graph Convolutional Networks. Key findings show that generative degradation strategies yield high realism (low FID) and strong segmentation performance on original maps, with DLCycleGAN offering the best overall metrics among the tested approaches. The results suggest a practical route to scalable, automated interpretation of vast historical map archives, potentially enabling widespread land-cover analysis without manual annotation.

Abstract

The automated analysis of historical documents, particularly maps, has drastically benefited from advances in deep learning and its success across various computer vision applications. However, most deep learning-based methods heavily rely on large amounts of annotated training data, which are typically unavailable for historical maps, especially for those belonging to specific, homogeneous cartographic domains, also known as corpora. Creating high-quality training data suitable for machine learning often takes a significant amount of time and involves extensive manual effort. While synthetic training data can alleviate the scarcity of real-world samples, it often lacks the affinity (realism) and diversity (variation) necessary for effective learning. By transferring the cartographic style of an original historical map corpus onto vector data, we bootstrap an effectively unlimited number of synthetic historical maps suitable for tasks such as land-cover interpretation of a homogeneous historical map corpus. We propose an automatic deep generative approach and a alternative manual stochastic degradation technique to emulate the visual uncertainty and noise, also known as data-dependent uncertainty, commonly observed in historical map scans. To quantitatively evaluate the effectiveness and applicability of our approach, the generated training datasets were employed for domain-adaptive semantic segmentation on a homogeneous map corpus using a Self-Constructing Graph Convolutional Network, enabling a comprehensive assessment of the impact of our data bootstrapping methods.

Paper Structure

This paper contains 10 sections, 1 equation, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Generated synthetic historical maps (left) and results of semantic segmentation of original historical maps (right).
  • Figure 2: Workflow for automatic generation of training data, comprising bootstrapped historical maps and their corresponding bootstrapped land-cover class annotations.
  • Figure 3: Complications in the preservation of historical maps. Accumulation of dust and mildew stains as well as imprecisions in shading and coloring.
  • Figure 4: Bootstrapped map pair example comprising a synthetically generated historical map and simulated data-dependent uncertainty.
  • Figure 5: Generator-Discriminator interplay of the underlying CycleGAN for data-dependent uncertainty simulation in style-transferred historical maps.
  • ...and 5 more figures