Multimodal Metadata Assignment for Cultural Heritage Artifacts
Luis Rei, Dunja Mladenić, Mareike Dorozynski, Franz Rottensteiner, Thomas Schleider, Raphaël Troncy, Jorge Sebastián Lozano, Mar Gaitán Salvatella
TL;DR
This work tackles the challenge of incomplete and heterogeneous cultural heritage metadata by proposing a multimodal metadata assignment framework that fuses image, text, and tabular data via late fusion. It introduces a ResNet-152-based image classifier, a multilingual XLM-R text classifier, and a Gradient Boosted Decision Tree (GBDT) tabular classifier, all trained multitask with imbalance-aware losses, and integrates predictions into a SILKNOW knowledge graph. A novel silk-fabric dataset aggregated from 12 sources demonstrates multilingual labeling, label grouping, and KG-driven data preparation, enabling robust evaluation across modalities. Results show the multimodal approach significantly outperforms single modalities (e.g., ~74.2% vs ~55.6% F1), with text providing the strongest single-modality signal when available and images offering broad coverage; the KG integration supports practical deployment via the ADASilk explorer. The work highlights the practical impact of multimodal metadata inference for CH, and points to future gains from self-supervised learning and refined fusion strategies to handle label noise and imbalance at scale.
Abstract
We develop a multimodal classifier for the cultural heritage domain using a late fusion approach and introduce a novel dataset. The three modalities are Image, Text, and Tabular data. We based the image classifier on a ResNet convolutional neural network architecture and the text classifier on a multilingual transformer architecture (XML-Roberta). Both are trained as multitask classifiers and use the focal loss to handle class imbalance. Tabular data and late fusion are handled by Gradient Tree Boosting. We also show how we leveraged specific data models and taxonomy in a Knowledge Graph to create the dataset and to store classification results. All individual classifiers accurately predict missing properties in the digitized silk artifacts, with the multimodal approach providing the best results.
