Self-supervised Video Instance Segmentation Can Boost Geographic Entity Alignment in Historical Maps
Xue Xia, Randall Balestriero, Tao Zhang, Lorenz Hurni
TL;DR
This work addresses the challenge of geographic entity alignment across historical maps by proposing a self-supervised, video instance segmentation (VIS) pipeline that unifies segmentation and linking into a 3D spatio-temporal volume across time. It introduces a pretraining strategy that generates synthetic two-frame videos from unlabeled map images to pretrain a VIS model (Mask2Former-VIS) before fine-tuning on labeled historical-map videos, leveraging a $T \times H \times W$ volume representation. The approach yields substantial gains over training from scratch (about a 24.9-point AP improvement and a 0.23 F1 increase) and outperforms conventional two-step linking, with synthetic historical map videos providing strong domain-relevant pretraining signals. The results demonstrate automated, distortion-robust geographic entity alignment with reduced annotation burden, enabling scalable analysis of cultural heritage, urban development, and environmental change across historical maps.
Abstract
Tracking geographic entities from historical maps, such as buildings, offers valuable insights into cultural heritage, urbanization patterns, environmental changes, and various historical research endeavors. However, linking these entities across diverse maps remains a persistent challenge for researchers. Traditionally, this has been addressed through a two-step process: detecting entities within individual maps and then associating them via a heuristic-based post-processing step. In this paper, we propose a novel approach that combines segmentation and association of geographic entities in historical maps using video instance segmentation (VIS). This method significantly streamlines geographic entity alignment and enhances automation. However, acquiring high-quality, video-format training data for VIS models is prohibitively expensive, especially for historical maps that often contain hundreds or thousands of geographic entities. To mitigate this challenge, we explore self-supervised learning (SSL) techniques to enhance VIS performance on historical maps. We evaluate the performance of VIS models under different pretraining configurations and introduce a novel method for generating synthetic videos from unlabeled historical map images for pretraining. Our proposed self-supervised VIS method substantially reduces the need for manual annotation. Experimental results demonstrate the superiority of the proposed self-supervised VIS approach, achieving a 24.9\% improvement in AP and a 0.23 increase in F1 score compared to the model trained from scratch.
