Table of Contents
Fetching ...

Graph Neural Networks for Knowledge Enhanced Visual Representation of Paintings

Athanasios Efthymiou, Stevan Rudinac, Monika Kackovic, Marcel Worring, Nachoem Wijnberg

TL;DR

ArtSAGENet introduces a knowledge-enhanced, multimodal framework that fuses CNN-based visual features with GNN-modeled semantic relationships among paintings to improve fine art analysis. By employing scalable graph neural networks and multi-task learning, the approach achieves state-of-the-art results on style classification, artist attribution, and creation-year estimation on WikiArt variants, while maintaining data efficiency and reduced training time. The method demonstrates both quantitative gains and qualitative improvements in retrieval, underscoring the value of integrating visual content with semantic context for art analysis and curation. This work paves the way for knowledge-aware art understanding and efficient curation in large-scale art collections.

Abstract

We propose ArtSAGENet, a novel multimodal architecture that integrates Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), to jointly learn visual and semantic-based artistic representations. First, we illustrate the significant advantages of multi-task learning for fine art analysis and argue that it is conceptually a much more appropriate setting in the fine art domain than the single-task alternatives. We further demonstrate that several GNN architectures can outperform strong CNN baselines in a range of fine art analysis tasks, such as style classification, artist attribution, creation period estimation, and tag prediction, while training them requires an order of magnitude less computational time and only a small amount of labeled data. Finally, through extensive experimentation we show that our proposed ArtSAGENet captures and encodes valuable relational dependencies between the artists and the artworks, surpassing the performance of traditional methods that rely solely on the analysis of visual content. Our findings underline a great potential of integrating visual content and semantics for fine art analysis and curation.

Graph Neural Networks for Knowledge Enhanced Visual Representation of Paintings

TL;DR

ArtSAGENet introduces a knowledge-enhanced, multimodal framework that fuses CNN-based visual features with GNN-modeled semantic relationships among paintings to improve fine art analysis. By employing scalable graph neural networks and multi-task learning, the approach achieves state-of-the-art results on style classification, artist attribution, and creation-year estimation on WikiArt variants, while maintaining data efficiency and reduced training time. The method demonstrates both quantitative gains and qualitative improvements in retrieval, underscoring the value of integrating visual content with semantic context for art analysis and curation. This work paves the way for knowledge-aware art understanding and efficient curation in large-scale art collections.

Abstract

We propose ArtSAGENet, a novel multimodal architecture that integrates Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), to jointly learn visual and semantic-based artistic representations. First, we illustrate the significant advantages of multi-task learning for fine art analysis and argue that it is conceptually a much more appropriate setting in the fine art domain than the single-task alternatives. We further demonstrate that several GNN architectures can outperform strong CNN baselines in a range of fine art analysis tasks, such as style classification, artist attribution, creation period estimation, and tag prediction, while training them requires an order of magnitude less computational time and only a small amount of labeled data. Finally, through extensive experimentation we show that our proposed ArtSAGENet captures and encodes valuable relational dependencies between the artists and the artworks, surpassing the performance of traditional methods that rely solely on the analysis of visual content. Our findings underline a great potential of integrating visual content and semantics for fine art analysis and curation.

Paper Structure

This paper contains 24 sections, 8 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: An illustration of our graph structured fine art analysis pipeline. First we extract visual features for node representations. Then, we utilize graph structure to learn context-aware fine art representations. Each node denotes a painting. Edges are drawn based on painting properties.
  • Figure 2: This figure depicts the ArtSAGENet architecture. Given a batch of images, the forward propagation is implemented as follows. GraphSAGE: (a) For each image in the batch we sample k neighbors, e.g., $k=3$, from h hops, e.g., $h= 2$, (b) aggregate the node feature vectors within the neighborhood to (c) obtain the final node representation. ResNet: (a) Each image of the batch is passing through the frozen part of the pre-trained (on ImageNet) network and then (b) it passes through the network's last bottleneck which is trainable and fine-tuned to obtain the final visual representation. Finally, the obtained representations are merged and the final multimodal representation is passed to the last layer (classifier or regressor). GraphSAGE (a-c) and ResNet-152 (e) are jointly trained for all three tasks in an MTL manner.
  • Figure 3: Qualitative analysis of learned visual representations for painting retrieval. Given a reference painting (middle), the top-5 nearest neighbors of the ResNet-152 (left) and the ArtSAGENet were retrieved (right). Misaligned patches denote paintings attributed with different style, artist or timeline annotation(s) from the reference painting. MTL illustrates the top-5 nearest neighbors retrieved using the Multi-task Learning model trained for style classification, artist attribution and timeframe estimation. The rest of the rows illustrate the top-5 nearest neighbors retrieved using the single-task classifiers. For the single-task tag prediction classifier, bold means that tag is attributed to the query painting, too. $\spadesuit$ means using visual features as node feature vectors.
  • Figure 4: Qualitative analysis of single and multi-task ArtSAGENet learned representations for painting retrieval.
  • Figure 5: Stylistic Movements distribution over time on WikiArt dataset.
  • ...and 3 more figures