Table of Contents
Fetching ...

Few-Shot Segmentation of Historical Maps via Linear Probing of Vision Foundation Models

Rafael Sterzinger, Marco Peer, Robert Sablatnig

TL;DR

The paper tackles few-shot semantic segmentation of historical maps, an area challenged by extreme visual variability and limited annotations. It proposes a simple yet effective three-stage framework that leverages semantic embeddings from vision foundation models (e.g., SAM, DINOv2, RADIO), a linear pixel-wise probing head, and parameter-efficient fine-tuning via low-rank adaptations (LoRA/DoRA). The method achieves state-of-the-art performance on Siegfried for railway and vineyard segmentation in 10-shot and 5-shot settings and attains a mean PQ of 67.3% on ICDAR 2021 building-block segmentation, all while using only about 689k trainable parameters (0.21% of the full model). These results demonstrate strong generalization in ultra-low data regimes and offer a practical pathway for scalable historical map digitization, with code publicly available for replication and extension.

Abstract

As rich sources of history, maps provide crucial insights into historical changes, yet their diverse visual representations and limited annotated data pose significant challenges for automated processing. We propose a simple yet effective approach for few-shot segmentation of historical maps, leveraging the rich semantic embeddings of large vision foundation models combined with parameter-efficient fine-tuning. Our method outperforms the state-of-the-art on the Siegfried benchmark dataset in vineyard and railway segmentation, achieving +5% and +13% relative improvements in mIoU in 10-shot scenarios and around +20% in the more challenging 5-shot setting. Additionally, it demonstrates strong performance on the ICDAR 2021 competition dataset, attaining a mean PQ of 67.3% for building block segmentation, despite not being optimized for this shape-sensitive metric, underscoring its generalizability. Notably, our approach maintains high performance even in extremely low-data regimes (10- & 5-shot), while requiring only 689k trainable parameters - just 0.21% of the total model size. Our approach enables precise segmentation of diverse historical maps while drastically reducing the need for manual annotations, advancing automated processing and analysis in the field. Our implementation is publicly available at: https://github.com/RafaelSterzinger/few-shot-map-segmentation.

Few-Shot Segmentation of Historical Maps via Linear Probing of Vision Foundation Models

TL;DR

The paper tackles few-shot semantic segmentation of historical maps, an area challenged by extreme visual variability and limited annotations. It proposes a simple yet effective three-stage framework that leverages semantic embeddings from vision foundation models (e.g., SAM, DINOv2, RADIO), a linear pixel-wise probing head, and parameter-efficient fine-tuning via low-rank adaptations (LoRA/DoRA). The method achieves state-of-the-art performance on Siegfried for railway and vineyard segmentation in 10-shot and 5-shot settings and attains a mean PQ of 67.3% on ICDAR 2021 building-block segmentation, all while using only about 689k trainable parameters (0.21% of the full model). These results demonstrate strong generalization in ultra-low data regimes and offer a practical pathway for scalable historical map digitization, with code publicly available for replication and extension.

Abstract

As rich sources of history, maps provide crucial insights into historical changes, yet their diverse visual representations and limited annotated data pose significant challenges for automated processing. We propose a simple yet effective approach for few-shot segmentation of historical maps, leveraging the rich semantic embeddings of large vision foundation models combined with parameter-efficient fine-tuning. Our method outperforms the state-of-the-art on the Siegfried benchmark dataset in vineyard and railway segmentation, achieving +5% and +13% relative improvements in mIoU in 10-shot scenarios and around +20% in the more challenging 5-shot setting. Additionally, it demonstrates strong performance on the ICDAR 2021 competition dataset, attaining a mean PQ of 67.3% for building block segmentation, despite not being optimized for this shape-sensitive metric, underscoring its generalizability. Notably, our approach maintains high performance even in extremely low-data regimes (10- & 5-shot), while requiring only 689k trainable parameters - just 0.21% of the total model size. Our approach enables precise segmentation of diverse historical maps while drastically reducing the need for manual annotations, advancing automated processing and analysis in the field. Our implementation is publicly available at: https://github.com/RafaelSterzinger/few-shot-map-segmentation.

Paper Structure

This paper contains 22 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: An excerpt of historical city maps from around the world, published between 1720 and 1950. © Petitpierre et al.
  • Figure 2: An illustration of the first three principal components of RADIO-H ranzinger_am-radio_2024 feature embeddings of a map of Paris: despite no prior training in this specialized domain, meaningful classes have already emerged: landmarks, building blocks, streets, and street names are clearly distinguishable.
  • Figure 3: A visualization of the first three principal components of feature embeddings from vision foundation models: subjectively speaking, RADIO exhibits the strongest spatial features, followed by DINOv2 and, lastly, SAM.
  • Figure 4: Analyzing the impact of resolution: RADIO is on par with / better than SAM, while being computationally more efficient ranzinger_am-radio_2024.
  • Figure 5: Examples of the Siegfried dataset xia_mapsam_2024, showing (\ref{['fig:input_ex']}) the input map, (\ref{['fig:gt_ex']}) the ground truth, and the predictions at (\ref{['fig:10shot_ex']}) 10-shots, (\ref{['fig:5shot_ex']}) 5-shots, and (\ref{['fig:1shot_ex']}) 1-shot.