Table of Contents
Fetching ...

Context-Infused Visual Grounding for Art

Selina Khan, Nanne van Noord

TL;DR

This paper presents CIGAr (Context-Infused GroundingDINO for Art), a visual grounding approach which utilises the artwork descriptions during training as context, thereby enabling visual grounding on art.

Abstract

Many artwork collections contain textual attributes that provide rich and contextualised descriptions of artworks. Visual grounding offers the potential for localising subjects within these descriptions on images, however, existing approaches are trained on natural images and generalise poorly to art. In this paper, we present CIGAr (Context-Infused GroundingDINO for Art), a visual grounding approach which utilises the artwork descriptions during training as context, thereby enabling visual grounding on art. In addition, we present a new dataset, Ukiyo-eVG, with manually annotated phrase-grounding annotations, and we set a new state-of-the-art for object detection on two artwork datasets.

Context-Infused Visual Grounding for Art

TL;DR

This paper presents CIGAr (Context-Infused GroundingDINO for Art), a visual grounding approach which utilises the artwork descriptions during training as context, thereby enabling visual grounding on art.

Abstract

Many artwork collections contain textual attributes that provide rich and contextualised descriptions of artworks. Visual grounding offers the potential for localising subjects within these descriptions on images, however, existing approaches are trained on natural images and generalise poorly to art. In this paper, we present CIGAr (Context-Infused GroundingDINO for Art), a visual grounding approach which utilises the artwork descriptions during training as context, thereby enabling visual grounding on art. In addition, we present a new dataset, Ukiyo-eVG, with manually annotated phrase-grounding annotations, and we set a new state-of-the-art for object detection on two artwork datasets.

Paper Structure

This paper contains 20 sections, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Example entries from the Ukiyo-eVG dataset.
  • Figure 2: Text token distribution of the input prompt on two identical box proposals. Box 1 (top) predicts the multi-token phrase 'two women a boy' as phrase groups 'two women' and 'a boy' both exceed the base threshold of $0.20$. Box 2 (bottom) shows a correct prediction referring to 'a boy'.
  • Figure 3: Overview of CIGAr. Components marked in red are additions to the GroundingDINO model architecture. Phrase and caption embeddings are extracted from a fine-tuned BERT encoder after which they are fused to provide context-rich phrase embeddings. They are passed through a cross-modality transformer alongside the image features before matching the predicted phrases and object boxes.
  • Figure 4: Examples of CIGAr predictions on Ukiyo-eVG data comparing the zero-shot GD output (left) with the CIGAr output (middle) and ground-truth (right).