Table of Contents
Fetching ...

An archaeological Catalog Collection Method Based on Large Vision-Language Models

Honglin Pang, Yi Chang, Tianjing Duan, Xi Yang

TL;DR

This work tackles the automated collection of archaeological catalogs, which are dispersed across numerous publications and contain images, descriptions, and excavation data. It introduces a three-module pipeline—Document Localization, Block Comprehension, and Block Matching—built around large vision-language models to localize blocks, extract structured attributes, and align multimodal information via foreign-key and distance-based matching. Experiments on Dabagou and Miaozigou pottery catalogs demonstrate improved accuracy over baselines and show the method’s model-agnostic robustness, with Claude 3.5 Sonnet achieving the highest reported performance among tested VLMs. The approach enables scalable, automated catalog assembly, supporting downstream tasks such as artifact classification, reconstruction, and visual question answering, and sets the stage for extending to broader artifact types and more sophisticated matching strategies.

Abstract

Archaeological catalogs, containing key elements such as artifact images, morphological descriptions, and excavation information, are essential for studying artifact evolution and cultural inheritance. These data are widely scattered across publications, requiring automated collection methods. However, existing Large Vision-Language Models (VLMs) and their derivative data collection methods face challenges in accurate image detection and modal matching when processing archaeological catalogs, making automated collection difficult. To address these issues, we propose a novel archaeological catalog collection method based on Large Vision-Language Models that follows an approach comprising three modules: document localization, block comprehension and block matching. Through practical data collection from the Dabagou and Miaozigou pottery catalogs and comparison experiments, we demonstrate the effectiveness of our approach, providing a reliable solution for automated collection of archaeological catalogs.

An archaeological Catalog Collection Method Based on Large Vision-Language Models

TL;DR

This work tackles the automated collection of archaeological catalogs, which are dispersed across numerous publications and contain images, descriptions, and excavation data. It introduces a three-module pipeline—Document Localization, Block Comprehension, and Block Matching—built around large vision-language models to localize blocks, extract structured attributes, and align multimodal information via foreign-key and distance-based matching. Experiments on Dabagou and Miaozigou pottery catalogs demonstrate improved accuracy over baselines and show the method’s model-agnostic robustness, with Claude 3.5 Sonnet achieving the highest reported performance among tested VLMs. The approach enables scalable, automated catalog assembly, supporting downstream tasks such as artifact classification, reconstruction, and visual question answering, and sets the stage for extending to broader artifact types and more sophisticated matching strategies.

Abstract

Archaeological catalogs, containing key elements such as artifact images, morphological descriptions, and excavation information, are essential for studying artifact evolution and cultural inheritance. These data are widely scattered across publications, requiring automated collection methods. However, existing Large Vision-Language Models (VLMs) and their derivative data collection methods face challenges in accurate image detection and modal matching when processing archaeological catalogs, making automated collection difficult. To address these issues, we propose a novel archaeological catalog collection method based on Large Vision-Language Models that follows an approach comprising three modules: document localization, block comprehension and block matching. Through practical data collection from the Dabagou and Miaozigou pottery catalogs and comparison experiments, we demonstrate the effectiveness of our approach, providing a reliable solution for automated collection of archaeological catalogs.
Paper Structure (11 sections, 9 equations, 4 figures, 1 table)

This paper contains 11 sections, 9 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Target tasks for catalog datasets.
  • Figure 2: The pipeline consists of three main modules: Document Localization Module for segmenting image and text regions, Block Comprehension Module for converting visual and textual information into structured representations, and Block Matching Module for aligning multi-modal information through matching rules to achieve automated archaeological catalog collection.
  • Figure 3: Annotations of a pottery example.
  • Figure 4: Statistical analysis of the dataset: (a) shows the distribution of samples across different categories, and (b) illustrates the distribution of artifacts per unit.