Table of Contents
Fetching ...

ArtContext: Contextualizing Artworks with Open-Access Art History Articles and Wikidata Knowledge through a LoRA-Tuned CLIP Model

Samuel Waugh, Stuart James

TL;DR

ArtContext tackles the challenge of grounding paintings in scholarly prose by linking artworks to sentences from open-access art-history articles and Wikidata metadata. It constructs a large-scale, weakly supervised training pipeline that ingests 27,044 open-access articles across 450 artists from OpenAlex, extracts candidate contexts with Sentence-BERT, and aligns them to paintings using Wikidata-informed queries to produce 29,697 image–text pairs. These pairs train PaintingCLIP, a LoRA-adapted version of CLIP for domain-specific grounding, achieving improved retrieval performance over vanilla CLIP while preserving zero-shot capabilities. The approach demonstrates that weak, scalable supervision from scholarly text can adapt vision–language models for nuanced humanities tasks and is readily generalizable to other domains with rich metadata and textual corpora.

Abstract

Many Art History articles discuss artworks in general as well as specific parts of works, such as layout, iconography, or material culture. However, when viewing an artwork, it is not trivial to identify what different articles have said about the piece. Therefore, we propose ArtContext, a pipeline for taking a corpus of Open-Access Art History articles and Wikidata Knowledge and annotating Artworks with this information. We do this using a novel corpus collection pipeline, then learn a bespoke CLIP model adapted using Low-Rank Adaptation (LoRA) to make it domain-specific. We show that the new model, PaintingCLIP, which is weakly supervised by the collected corpus, outperforms CLIP and provides context for a given artwork. The proposed pipeline is generalisable and can be readily applied to numerous humanities areas.

ArtContext: Contextualizing Artworks with Open-Access Art History Articles and Wikidata Knowledge through a LoRA-Tuned CLIP Model

TL;DR

ArtContext tackles the challenge of grounding paintings in scholarly prose by linking artworks to sentences from open-access art-history articles and Wikidata metadata. It constructs a large-scale, weakly supervised training pipeline that ingests 27,044 open-access articles across 450 artists from OpenAlex, extracts candidate contexts with Sentence-BERT, and aligns them to paintings using Wikidata-informed queries to produce 29,697 image–text pairs. These pairs train PaintingCLIP, a LoRA-adapted version of CLIP for domain-specific grounding, achieving improved retrieval performance over vanilla CLIP while preserving zero-shot capabilities. The approach demonstrates that weak, scalable supervision from scholarly text can adapt vision–language models for nuanced humanities tasks and is readily generalizable to other domains with rich metadata and textual corpora.

Abstract

Many Art History articles discuss artworks in general as well as specific parts of works, such as layout, iconography, or material culture. However, when viewing an artwork, it is not trivial to identify what different articles have said about the piece. Therefore, we propose ArtContext, a pipeline for taking a corpus of Open-Access Art History articles and Wikidata Knowledge and annotating Artworks with this information. We do this using a novel corpus collection pipeline, then learn a bespoke CLIP model adapted using Low-Rank Adaptation (LoRA) to make it domain-specific. We show that the new model, PaintingCLIP, which is weakly supervised by the collected corpus, outperforms CLIP and provides context for a given artwork. The proposed pipeline is generalisable and can be readily applied to numerous humanities areas.
Paper Structure (13 sections, 8 equations, 4 figures)

This paper contains 13 sections, 8 equations, 4 figures.

Figures (4)

  • Figure 1: Bacchus and Ariadne (Titian, 1523) -- CLIP-based saliency visualization of a Heat-map overlay produced by back-propagating image–text similarity for the sentence; warmer regions mark the most influence on the model’s score.
  • Figure 2: Overview of $\mathcal{A}$rtContext . From a set of artist $\mathcal{A}$, (a) Open-access art-historical articles are harvested via OpenAlex and converted into structured text, from which candidate sentence contexts are extracted and embedded using Sentence-BERT. (b) For each painting, Wikidata metadata is used to construct a semantic query that is matched against candidate sentences to select the most relevant description. (c) The resulting image–text pairs supervise Low-Rank Adaptation (LoRA) fine-tuning of CLIP ViT-B/32, producing PaintingCLIP with improved alignment between paintings and art-historical articles.
  • Figure 3: Example extracted sentences for The Night Watch (Rembrandt van Rijn, 1642) using PaintingCLIP model and the corpus.
  • Figure 4: Averaged $\bar{P}(R)$ curves comparison over a set of queries for CLIP vs. PaintingCLIP