Table of Contents
Fetching ...

Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage

Dario Cioni, Lorenzo Berlincioni, Federico Becattini, Alberto del Bimbo

TL;DR

The paper tackles data scarcity and domain shift in cultural heritage by proposing a semantic augmentation workflow that uses caption-conditioned Latent Diffusion Models (Stable Diffusion) to create multiple, content-preserving visual variations per artwork, forming a synthetic dataset of size $N\times M$ and trained with a balancing factor $\alpha=0.5$. It evaluates the approach on Artpedia and ArtCap for image captioning (with GIT-base and BLIP-base) and cross-domain retrieval (with CLIP), showing improvements in standard captioning metrics and retrieval recalls over strong baselines. The study demonstrates that diffusion-based augmentation can better align visual content with domain-specific jargon, enhancing grounding of artistic knowledge and semantic richness in captions. This method offers practical value for cultural heritage applications by enabling more effective and accessible interactions with artworks through improved captioning and retrieval capabilities.

Abstract

Cultural heritage applications and advanced machine learning models are creating a fruitful synergy to provide effective and accessible ways of interacting with artworks. Smart audio-guides, personalized art-related content and gamification approaches are just a few examples of how technology can be exploited to provide additional value to artists or exhibitions. Nonetheless, from a machine learning point of view, the amount of available artistic data is often not enough to train effective models. Off-the-shelf computer vision modules can still be exploited to some extent, yet a severe domain shift is present between art images and standard natural image datasets used to train such models. As a result, this can lead to degraded performance. This paper introduces a novel approach to address the challenges of limited annotated data and domain shifts in the cultural heritage domain. By leveraging generative vision-language models, we augment art datasets by generating diverse variations of artworks conditioned on their captions. This augmentation strategy enhances dataset diversity, bridging the gap between natural images and artworks, and improving the alignment of visual cues with knowledge from general-purpose datasets. The generated variations assist in training vision and language models with a deeper understanding of artistic characteristics and that are able to generate better captions with appropriate jargon.

Diffusion Based Augmentation for Captioning and Retrieval in Cultural Heritage

TL;DR

The paper tackles data scarcity and domain shift in cultural heritage by proposing a semantic augmentation workflow that uses caption-conditioned Latent Diffusion Models (Stable Diffusion) to create multiple, content-preserving visual variations per artwork, forming a synthetic dataset of size and trained with a balancing factor . It evaluates the approach on Artpedia and ArtCap for image captioning (with GIT-base and BLIP-base) and cross-domain retrieval (with CLIP), showing improvements in standard captioning metrics and retrieval recalls over strong baselines. The study demonstrates that diffusion-based augmentation can better align visual content with domain-specific jargon, enhancing grounding of artistic knowledge and semantic richness in captions. This method offers practical value for cultural heritage applications by enabling more effective and accessible interactions with artworks through improved captioning and retrieval capabilities.

Abstract

Cultural heritage applications and advanced machine learning models are creating a fruitful synergy to provide effective and accessible ways of interacting with artworks. Smart audio-guides, personalized art-related content and gamification approaches are just a few examples of how technology can be exploited to provide additional value to artists or exhibitions. Nonetheless, from a machine learning point of view, the amount of available artistic data is often not enough to train effective models. Off-the-shelf computer vision modules can still be exploited to some extent, yet a severe domain shift is present between art images and standard natural image datasets used to train such models. As a result, this can lead to degraded performance. This paper introduces a novel approach to address the challenges of limited annotated data and domain shifts in the cultural heritage domain. By leveraging generative vision-language models, we augment art datasets by generating diverse variations of artworks conditioned on their captions. This augmentation strategy enhances dataset diversity, bridging the gap between natural images and artworks, and improving the alignment of visual cues with knowledge from general-purpose datasets. The generated variations assist in training vision and language models with a deeper understanding of artistic characteristics and that are able to generate better captions with appropriate jargon.
Paper Structure (15 sections, 7 figures, 4 tables)

This paper contains 15 sections, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Schematic illustrating the data augmentation pipeline. The conditional generative model allows for both and image and text input while in its Image&Text$\longrightarrow$Image configuration. We provide the model both the original artwork along with its detailed textual analysis from stefanini2019artpedia and use the diffusion model's outputs as new datapoints for training other models for downstream tasks.
  • Figure 2: Distribution of caption lengths in the Artpedia stefanini2019artpedia and ArtCap artcap datasets
  • Figure 3: Samples of images along with their textual descriptions from Artpedia (top) and ArtCap (bottom) datasets.
  • Figure 4: Samples of the augmented images. Left: Original image and its caption.; Right: Multiple samples of the augmented images using the combination of the provided description and the original input image.
  • Figure 5: Average cosine similarity between CLIP embeddings of: (a) real images and the associated captions; (b) synthetic images and the associated captions; (c) real images and their synthetic variations.
  • ...and 2 more figures