Table of Contents
Fetching ...

MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series

Xue Xia, Randall Balestriero, Tao Zhang, Yixin Zhou, Andrew Ding, Dev Saini, Lorenz Hurni

TL;DR

MapSAM2 tackles the challenge of automatic segmentation in historical map images and their time series by repurposing SAM2 as a video-based model. It introduces a LoRA-adapted image encoder, a self-sorting memory bank, and a memory-attention mechanism to process sets of tiles as pseudo-videos and treat time-series as actual videos, enabling end-to-end segmentation and linking with limited supervision. A YOLO-based prompting pipeline and a pseudo-video generation strategy reduce annotation costs, while the Siegfried Building Time Series Dataset and accompanying pseudo datasets provide public benchmarks. The approach yields strong performance, especially on areal features, and demonstrates robust few-shot capabilities for both image segmentation and time-series linking, with practical impact for dating buildings and analyzing historic network changes; code and data release further support reproducibility.

Abstract

Historical maps are unique and valuable archives that document geographic features across different time periods. However, automated analysis of historical map images remains a significant challenge due to their wide stylistic variability and the scarcity of annotated training data. Constructing linked spatio-temporal datasets from historical map time series is even more time-consuming and labor-intensive, as it requires synthesizing information from multiple maps. Such datasets are essential for applications such as dating buildings, analyzing the development of road networks and settlements, studying environmental changes etc. We present MapSAM2, a unified framework for automatically segmenting both historical map images and time series. Built on a visual foundation model, MapSAM2 adapts to diverse segmentation tasks with few-shot fine-tuning. Our key innovation is to treat both historical map images and time series as videos. For images, we process a set of tiles as a video, enabling the memory attention mechanism to incorporate contextual cues from similar tiles, leading to improved geometric accuracy, particularly for areal features. For time series, we introduce the annotated Siegfried Building Time Series Dataset and, to reduce annotation costs, propose generating pseudo time series from single-year maps by simulating common temporal transformations. Experimental results show that MapSAM2 learns temporal associations effectively and can accurately segment and link buildings in time series under limited supervision or using pseudo videos. We will release both our dataset and code to support future research.

MapSAM2: Adapting SAM2 for Automatic Segmentation of Historical Map Images and Time Series

TL;DR

MapSAM2 tackles the challenge of automatic segmentation in historical map images and their time series by repurposing SAM2 as a video-based model. It introduces a LoRA-adapted image encoder, a self-sorting memory bank, and a memory-attention mechanism to process sets of tiles as pseudo-videos and treat time-series as actual videos, enabling end-to-end segmentation and linking with limited supervision. A YOLO-based prompting pipeline and a pseudo-video generation strategy reduce annotation costs, while the Siegfried Building Time Series Dataset and accompanying pseudo datasets provide public benchmarks. The approach yields strong performance, especially on areal features, and demonstrates robust few-shot capabilities for both image segmentation and time-series linking, with practical impact for dating buildings and analyzing historic network changes; code and data release further support reproducibility.

Abstract

Historical maps are unique and valuable archives that document geographic features across different time periods. However, automated analysis of historical map images remains a significant challenge due to their wide stylistic variability and the scarcity of annotated training data. Constructing linked spatio-temporal datasets from historical map time series is even more time-consuming and labor-intensive, as it requires synthesizing information from multiple maps. Such datasets are essential for applications such as dating buildings, analyzing the development of road networks and settlements, studying environmental changes etc. We present MapSAM2, a unified framework for automatically segmenting both historical map images and time series. Built on a visual foundation model, MapSAM2 adapts to diverse segmentation tasks with few-shot fine-tuning. Our key innovation is to treat both historical map images and time series as videos. For images, we process a set of tiles as a video, enabling the memory attention mechanism to incorporate contextual cues from similar tiles, leading to improved geometric accuracy, particularly for areal features. For time series, we introduce the annotated Siegfried Building Time Series Dataset and, to reduce annotation costs, propose generating pseudo time series from single-year maps by simulating common temporal transformations. Experimental results show that MapSAM2 learns temporal associations effectively and can accurately segment and link buildings in time series under limited supervision or using pseudo videos. We will release both our dataset and code to support future research.

Paper Structure

This paper contains 22 sections, 5 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Segmentation capabilities of MapSAM2. MapSAM2 supports (a) instance-level segmentation and linking for historical map time series and (b) semantic segmentation for historical map images.
  • Figure 2: The MapSAM2 architecture. We propose treating both historical map time series and sets of map images as videos to enable memory-enhanced historical map segmentation. For time series data, YOLO is used to provide bounding box prompts, and a first-in–first-out strategy is applied to build the memory bank using the $k$ most recent frames for memory attention. For images, no external prompts are provided; instead, the memory bank is constructed based on confidence and dissimilarity, followed by weighted sampling to select the $k$ most relevant frames for memory attention. In the figure, solid arrows indicate operations common to both data types, blue dashed arrows denote operations specific to time series, and orange dashed arrows denote operations specific to images.
  • Figure 3: Generating pseudo time series by transforming single-year maps. The applied transformations are highlighted with bounding boxes in the examples: (a) shift, (b) appearance and disappearance, and (c) shape change and merge.
  • Figure 4: Image segmentation results from U-Net, MapSAM, and MapSAM2, each trained with 10-shot samples for detecting railway, vineyard, and building block.
  • Figure 5: Video segmentation results under 10-shot training on the real Siegfried Building Time Series: (a) Mask R-CNN with linking, (b) Mask2Former-VIS, (c) MapSAM2 prompted by YOLO trained on the same 10-shot data, and (d) MapSAM2 prompted by YOLO trained on the full dataset. The YOLO prompt is provided only for the latest frame and is shown as a green bounding box. A challenging case, where two small buildings merge into a larger structure over time, is highlighted with a circle (green indicates successful video segmentation, red indicates failure). Links are indicated with arrows: solid arrows denote correct links, while dashed arrows denote incorrect links.