Table of Contents
Fetching ...

Leveraging LLMs and attention-mechanism for automatic annotation of historical maps

Yunshuang Yuan, Monika Sester

TL;DR

This work tackles the challenge of scalable, automatic annotation of historical maps by distilling knowledge from large language models (LLMs) into an attention-based image classifier. Large patches are labeled by an LLM, and an Encoder–Drop Token–Cross-Attention framework learns to classify patches while producing attention maps that localize foreground objects; these maps are iteratively refined to provide higher-resolution annotations. Quantitatively, the method achieves high recall (>90%) and competitive IoU and precision (e.g., IoU for Wood ≈ 0.842 and Settlement ≈ 0.720 with precision ≈ 0.871 and 0.795) without requiring fine-grained manual labels during training. This approach enables scalable, semi-automatic semantic labeling of historical maps, with potential downstream benefits for pixel-level segmentation and scene-graph construction, while also highlighting areas for improvement such as boundary delineation and LLM-label accuracy.

Abstract

Historical maps are essential resources that provide insights into the geographical landscapes of the past. They serve as valuable tools for researchers across disciplines such as history, geography, and urban studies, facilitating the reconstruction of historical environments and the analysis of spatial transformations over time. However, when constrained to analogue or scanned formats, their interpretation is limited to humans and therefore not scalable. Recent advancements in machine learning, particularly in computer vision and large language models (LLMs), have opened new avenues for automating the recognition and classification of features and objects in historical maps. In this paper, we propose a novel distillation method that leverages LLMs and attention mechanisms for the automatic annotation of historical maps. LLMs are employed to generate coarse classification labels for low-resolution historical image patches, while attention mechanisms are utilized to refine these labels to higher resolutions. Experimental results demonstrate that the refined labels achieve a high recall of more than 90%. Additionally, the intersection over union (IoU) scores--84.2% for Wood and 72.0% for Settlement--along with precision scores of 87.1% and 79.5%, respectively, indicate that most labels are well-aligned with ground-truth annotations. Notably, these results were achieved without the use of fine-grained manual labels during training, underscoring the potential of our approach for efficient and scalable historical map analysis.

Leveraging LLMs and attention-mechanism for automatic annotation of historical maps

TL;DR

This work tackles the challenge of scalable, automatic annotation of historical maps by distilling knowledge from large language models (LLMs) into an attention-based image classifier. Large patches are labeled by an LLM, and an Encoder–Drop Token–Cross-Attention framework learns to classify patches while producing attention maps that localize foreground objects; these maps are iteratively refined to provide higher-resolution annotations. Quantitatively, the method achieves high recall (>90%) and competitive IoU and precision (e.g., IoU for Wood ≈ 0.842 and Settlement ≈ 0.720 with precision ≈ 0.871 and 0.795) without requiring fine-grained manual labels during training. This approach enables scalable, semi-automatic semantic labeling of historical maps, with potential downstream benefits for pixel-level segmentation and scene-graph construction, while also highlighting areas for improvement such as boundary delineation and LLM-label accuracy.

Abstract

Historical maps are essential resources that provide insights into the geographical landscapes of the past. They serve as valuable tools for researchers across disciplines such as history, geography, and urban studies, facilitating the reconstruction of historical environments and the analysis of spatial transformations over time. However, when constrained to analogue or scanned formats, their interpretation is limited to humans and therefore not scalable. Recent advancements in machine learning, particularly in computer vision and large language models (LLMs), have opened new avenues for automating the recognition and classification of features and objects in historical maps. In this paper, we propose a novel distillation method that leverages LLMs and attention mechanisms for the automatic annotation of historical maps. LLMs are employed to generate coarse classification labels for low-resolution historical image patches, while attention mechanisms are utilized to refine these labels to higher resolutions. Experimental results demonstrate that the refined labels achieve a high recall of more than 90%. Additionally, the intersection over union (IoU) scores--84.2% for Wood and 72.0% for Settlement--along with precision scores of 87.1% and 79.5%, respectively, indicate that most labels are well-aligned with ground-truth annotations. Notably, these results were achieved without the use of fine-grained manual labels during training, underscoring the potential of our approach for efficient and scalable historical map analysis.

Paper Structure

This paper contains 10 sections, 1 equation, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Prompt image for LLM.
  • Figure 2: LLM prompting on an example of a historical map.
  • Figure 3: Framework for attention-based image classification. The input image features are first extracted by the Encoder into image tokens, with a subset discarded by the Drop Token module. The remaining tokens are processed by the Cross-Attention module to produce the final binary classification result. Post-training, the learned attention weights from the Cross-Attention module are used to generate the Attention Map.
  • Figure 4: An example of attention map generation with 16 image tokens. The final attention map is generated with 16 forward runs of the trained model. Each column indicates one round. The white squares in the first raw indicate that the corresponding token is dropped (features are set to zeros). The red squares in the second row show the selected maximum attention weight in each forward round, eventually composing the Attention Map.
  • Figure 5: Example of attention maps overlay on a $5\times 5$ grid of input images. Each image is covered by a $6\time 6$ grid of attention weights.
  • ...and 4 more figures