Table of Contents
Fetching ...

Multimodal Metadata Assignment for Cultural Heritage Artifacts

Luis Rei, Dunja Mladenić, Mareike Dorozynski, Franz Rottensteiner, Thomas Schleider, Raphaël Troncy, Jorge Sebastián Lozano, Mar Gaitán Salvatella

TL;DR

This work tackles the challenge of incomplete and heterogeneous cultural heritage metadata by proposing a multimodal metadata assignment framework that fuses image, text, and tabular data via late fusion. It introduces a ResNet-152-based image classifier, a multilingual XLM-R text classifier, and a Gradient Boosted Decision Tree (GBDT) tabular classifier, all trained multitask with imbalance-aware losses, and integrates predictions into a SILKNOW knowledge graph. A novel silk-fabric dataset aggregated from 12 sources demonstrates multilingual labeling, label grouping, and KG-driven data preparation, enabling robust evaluation across modalities. Results show the multimodal approach significantly outperforms single modalities (e.g., ~74.2% vs ~55.6% F1), with text providing the strongest single-modality signal when available and images offering broad coverage; the KG integration supports practical deployment via the ADASilk explorer. The work highlights the practical impact of multimodal metadata inference for CH, and points to future gains from self-supervised learning and refined fusion strategies to handle label noise and imbalance at scale.

Abstract

We develop a multimodal classifier for the cultural heritage domain using a late fusion approach and introduce a novel dataset. The three modalities are Image, Text, and Tabular data. We based the image classifier on a ResNet convolutional neural network architecture and the text classifier on a multilingual transformer architecture (XML-Roberta). Both are trained as multitask classifiers and use the focal loss to handle class imbalance. Tabular data and late fusion are handled by Gradient Tree Boosting. We also show how we leveraged specific data models and taxonomy in a Knowledge Graph to create the dataset and to store classification results. All individual classifiers accurately predict missing properties in the digitized silk artifacts, with the multimodal approach providing the best results.

Multimodal Metadata Assignment for Cultural Heritage Artifacts

TL;DR

This work tackles the challenge of incomplete and heterogeneous cultural heritage metadata by proposing a multimodal metadata assignment framework that fuses image, text, and tabular data via late fusion. It introduces a ResNet-152-based image classifier, a multilingual XLM-R text classifier, and a Gradient Boosted Decision Tree (GBDT) tabular classifier, all trained multitask with imbalance-aware losses, and integrates predictions into a SILKNOW knowledge graph. A novel silk-fabric dataset aggregated from 12 sources demonstrates multilingual labeling, label grouping, and KG-driven data preparation, enabling robust evaluation across modalities. Results show the multimodal approach significantly outperforms single modalities (e.g., ~74.2% vs ~55.6% F1), with text providing the strongest single-modality signal when available and images offering broad coverage; the KG integration supports practical deployment via the ADASilk explorer. The work highlights the practical impact of multimodal metadata inference for CH, and points to future gains from self-supervised learning and refined fusion strategies to handle label noise and imbalance at scale.

Abstract

We develop a multimodal classifier for the cultural heritage domain using a late fusion approach and introduce a novel dataset. The three modalities are Image, Text, and Tabular data. We based the image classifier on a ResNet convolutional neural network architecture and the text classifier on a multilingual transformer architecture (XML-Roberta). Both are trained as multitask classifiers and use the focal loss to handle class imbalance. Tabular data and late fusion are handled by Gradient Tree Boosting. We also show how we leveraged specific data models and taxonomy in a Knowledge Graph to create the dataset and to store classification results. All individual classifiers accurately predict missing properties in the digitized silk artifacts, with the multimodal approach providing the best results.
Paper Structure (56 sections, 6 equations, 13 figures, 22 tables)

This paper contains 56 sections, 6 equations, 13 figures, 22 tables.

Figures (13)

  • Figure 1: Images in our dataset with labels
  • Figure 2: A record from the MET museum with a missing property represented in the knowledge graph using our ontology and controlled vocabularies
  • Figure 3: Network architecture of the CNN for multitask image classification
  • Figure 4: Multitask architecture of the text classifier consisting of a shared XLM-R based encoder followed by task-specific classification heads
  • Figure 5: Task-specific classification head of the text classifier consisting of a fully connected (FC) layer followed by a tanh activation followed by the output projection FC layer
  • ...and 8 more figures