Table of Contents
Fetching ...

MapSAM: Adapting Segment Anything Model for Automated Feature Detection in Historical Maps

Xue Xia, Daiwei Zhang, Wenxuan Song, Wei Huang, Lorenz Hurni

TL;DR

The proposed MapSAM framework demonstrates promising performance across three distinct historical map segmentation tasks: railway, vineyard, and building block detection, and Experimental results show that it adapts well to various features, even when fine-tuned with extremely limited data.

Abstract

Automated feature detection in historical maps can significantly accelerate the reconstruction of the geospatial past. However, this process is often constrained by the time-consuming task of manually digitizing sufficient high-quality training data. The emergence of visual foundation models, such as the Segment Anything Model (SAM), offers a promising solution due to their remarkable generalization capabilities and rapid adaptation to new data distributions. Despite this, directly applying SAM in a zero-shot manner to historical map segmentation poses significant challenges, including poor recognition of certain geospatial features and a reliance on input prompts, which limits its ability to be fully automated. To address these challenges, we introduce MapSAM, a parameter-efficient fine-tuning strategy that adapts SAM into a prompt-free and versatile solution for various downstream historical map segmentation tasks. Specifically, we employ Weight-Decomposed Low-Rank Adaptation (DoRA) to integrate domain-specific knowledge into the image encoder. Additionally, we develop an automatic prompt generation process, eliminating the need for manual input. We further enhance the positional prompt in SAM, transforming it into a higher-level positional-semantic prompt, and modify the cross-attention mechanism in the mask decoder with masked attention for more effective feature aggregation. The proposed MapSAM framework demonstrates promising performance across two distinct historical map segmentation tasks: one focused on linear features and the other on areal features. Experimental results show that it adapts well to various features, even when fine-tuned with extremely limited data (e.g. 10 shots).

MapSAM: Adapting Segment Anything Model for Automated Feature Detection in Historical Maps

TL;DR

The proposed MapSAM framework demonstrates promising performance across three distinct historical map segmentation tasks: railway, vineyard, and building block detection, and Experimental results show that it adapts well to various features, even when fine-tuned with extremely limited data.

Abstract

Automated feature detection in historical maps can significantly accelerate the reconstruction of the geospatial past. However, this process is often constrained by the time-consuming task of manually digitizing sufficient high-quality training data. The emergence of visual foundation models, such as the Segment Anything Model (SAM), offers a promising solution due to their remarkable generalization capabilities and rapid adaptation to new data distributions. Despite this, directly applying SAM in a zero-shot manner to historical map segmentation poses significant challenges, including poor recognition of certain geospatial features and a reliance on input prompts, which limits its ability to be fully automated. To address these challenges, we introduce MapSAM, a parameter-efficient fine-tuning strategy that adapts SAM into a prompt-free and versatile solution for various downstream historical map segmentation tasks. Specifically, we employ Weight-Decomposed Low-Rank Adaptation (DoRA) to integrate domain-specific knowledge into the image encoder. Additionally, we develop an automatic prompt generation process, eliminating the need for manual input. We further enhance the positional prompt in SAM, transforming it into a higher-level positional-semantic prompt, and modify the cross-attention mechanism in the mask decoder with masked attention for more effective feature aggregation. The proposed MapSAM framework demonstrates promising performance across two distinct historical map segmentation tasks: one focused on linear features and the other on areal features. Experimental results show that it adapts well to various features, even when fine-tuned with extremely limited data (e.g. 10 shots).

Paper Structure

This paper contains 17 sections, 8 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Examples of using zero-shot SAM and MapSAM for feature segmentation in historical maps. Rows (a) and (b) depict the segmentation of railways (two thick parallel lines) and vineyards (a set of vertical strokes), respectively. Zero-shot SAM requires manual prompts—green and red points representing positive and negative point prompts, respectively—but fails to delineate clear boundaries for the target objects effectively. In contrast, MapSAM eliminates the need for manual intervention and significantly improves segmentation accuracy.
  • Figure 2: The overall framework of MapSAM. We insert trainable DoRA layers into the image encoder to incorporate domain-specific feature information. The proposed auto-prompt generator leverages multi-layer feature embeddings to generate positive-negative point prompts. These prompts are combined with the target object's embedding to form high-level positional-semantic prompts. Finally, the positional-semantic prompts interact with the image embedding in the mask decoder using modified masked attention to generate the final object mask.
  • Figure 3: Weight update mechanisms in (a) regular, (b) LoRA and (c) DoRA fine-tuning. In regular fine-tuning, the weight update matrix $\Delta W$ has the same dimensions as the pre-trained weight matrix. LoRA reduces the number of learnable parameters by approximating the weight update using two low-rank matrices, $B$ and $A$. DoRA further decomposes the weight update into magnitude and direction components, updating them separately: the magnitude matrix $m$ is trained directly, while the directional component $V$ is updated following the LoRA strategy.
  • Figure 4: The coarse mask generated by the auto-prompt generator, with the corresponding positive (green) and negative (red) point prompts, applied to input images for railway detection (left) and vineyard detection (right).
  • Figure 5: The modified masked-attention mask decoder in MapSAM. The coarse mask produced by the auto-prompt generator serves as the initial mask for token-to-image masked attention. As the process progresses through each decoder layer, this attention mask is iteratively refined by incorporating the updated image embeddings and prompt tokens, allowing for more precise modulation and enhanced accuracy.
  • ...and 2 more figures