Leveraging LLMs and attention-mechanism for automatic annotation of historical maps
Yunshuang Yuan, Monika Sester
TL;DR
This work tackles the challenge of scalable, automatic annotation of historical maps by distilling knowledge from large language models (LLMs) into an attention-based image classifier. Large patches are labeled by an LLM, and an Encoder–Drop Token–Cross-Attention framework learns to classify patches while producing attention maps that localize foreground objects; these maps are iteratively refined to provide higher-resolution annotations. Quantitatively, the method achieves high recall (>90%) and competitive IoU and precision (e.g., IoU for Wood ≈ 0.842 and Settlement ≈ 0.720 with precision ≈ 0.871 and 0.795) without requiring fine-grained manual labels during training. This approach enables scalable, semi-automatic semantic labeling of historical maps, with potential downstream benefits for pixel-level segmentation and scene-graph construction, while also highlighting areas for improvement such as boundary delineation and LLM-label accuracy.
Abstract
Historical maps are essential resources that provide insights into the geographical landscapes of the past. They serve as valuable tools for researchers across disciplines such as history, geography, and urban studies, facilitating the reconstruction of historical environments and the analysis of spatial transformations over time. However, when constrained to analogue or scanned formats, their interpretation is limited to humans and therefore not scalable. Recent advancements in machine learning, particularly in computer vision and large language models (LLMs), have opened new avenues for automating the recognition and classification of features and objects in historical maps. In this paper, we propose a novel distillation method that leverages LLMs and attention mechanisms for the automatic annotation of historical maps. LLMs are employed to generate coarse classification labels for low-resolution historical image patches, while attention mechanisms are utilized to refine these labels to higher resolutions. Experimental results demonstrate that the refined labels achieve a high recall of more than 90%. Additionally, the intersection over union (IoU) scores--84.2% for Wood and 72.0% for Settlement--along with precision scores of 87.1% and 79.5%, respectively, indicate that most labels are well-aligned with ground-truth annotations. Notably, these results were achieved without the use of fine-grained manual labels during training, underscoring the potential of our approach for efficient and scalable historical map analysis.
