Table of Contents
Fetching ...

No Annotations for Object Detection in Art through Stable Diffusion

Patrick Ramos, Nicolas Gonthier, Selina Khan, Yuta Nakashima, Noa Garcia

TL;DR

This work tackles object detection in art under limited supervision by introducing NADA, a two-module pipeline that combines a class proposer (weakly-supervised or zero-shot) with a diffusion-based class-conditioned detector. The detector leverages Stable Diffusion cross-attention maps obtained via inversion and reconstruction to generate bounding boxes through watershed segmentation, without fine-tuning pretrained components. The WSCP and ZSCP variants enable detection across ArtDL 2.0 and IconArt, achieving state-of-the-art weakly-supervised results and presenting the first zero-shot results in the art domain. Ablation studies show that the class proposer quality drives performance, and prompting strategies influence results, with successful localization demonstrated on WikiArt images, indicating broad applicability to art imagery while reducing annotation burdens.

Abstract

Object detection in art is a valuable tool for the digital humanities, as it allows for faster identification of objects in artistic and historical images compared to humans. However, annotating such images poses significant challenges due to the need for specialized domain expertise. We present NADA (no annotations for detection in art), a pipeline that leverages diffusion models' art-related knowledge for object detection in paintings without the need for full bounding box supervision. Our method, which supports both weakly-supervised and zero-shot scenarios and does not require any fine-tuning of its pretrained components, consists of a class proposer based on large vision-language models and a class-conditioned detector based on Stable Diffusion. NADA is evaluated on two artwork datasets, ArtDL 2.0 and IconArt, outperforming prior work in weakly-supervised detection, while being the first work for zero-shot object detection in art. Code is available at https://github.com/patrick-john-ramos/nada

No Annotations for Object Detection in Art through Stable Diffusion

TL;DR

This work tackles object detection in art under limited supervision by introducing NADA, a two-module pipeline that combines a class proposer (weakly-supervised or zero-shot) with a diffusion-based class-conditioned detector. The detector leverages Stable Diffusion cross-attention maps obtained via inversion and reconstruction to generate bounding boxes through watershed segmentation, without fine-tuning pretrained components. The WSCP and ZSCP variants enable detection across ArtDL 2.0 and IconArt, achieving state-of-the-art weakly-supervised results and presenting the first zero-shot results in the art domain. Ablation studies show that the class proposer quality drives performance, and prompting strategies influence results, with successful localization demonstrated on WikiArt images, indicating broad applicability to art imagery while reducing annotation burdens.

Abstract

Object detection in art is a valuable tool for the digital humanities, as it allows for faster identification of objects in artistic and historical images compared to humans. However, annotating such images poses significant challenges due to the need for specialized domain expertise. We present NADA (no annotations for detection in art), a pipeline that leverages diffusion models' art-related knowledge for object detection in paintings without the need for full bounding box supervision. Our method, which supports both weakly-supervised and zero-shot scenarios and does not require any fine-tuning of its pretrained components, consists of a class proposer based on large vision-language models and a class-conditioned detector based on Stable Diffusion. NADA is evaluated on two artwork datasets, ArtDL 2.0 and IconArt, outperforming prior work in weakly-supervised detection, while being the first work for zero-shot object detection in art. Code is available at https://github.com/patrick-john-ramos/nada

Paper Structure

This paper contains 43 sections, 2 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Art object detection in the wild with NADA's class-conditioned detector.
  • Figure 2: NADA consists of predicting classes from a painting with a class proposer and extracting bounding boxes for the predicted classes with a class-conditioned detector. The class proposer can operate in a weakly-supervised or a zero-shot setting. The class-conditioned detector leverages Stable Diffusion to extract bounding boxes by inverting and regenerating the painting conditioned on an input prompt. The cross-attention maps from the predicted class are aggregated and processed with watershed segmentation to find the bounding box.
  • Figure 3: Bounding box extraction from attention maps.
  • Figure 4: ArtDL 2.0 and IconArt test images overlaid with NADA (with ZSCP) attention maps and bounding boxes, shown in pairs. Redder areas indicate higher attention while bluer areas indicate lower attention. Correct model predictions are in green, incorrect model predictions are in red, and ground truth boxes when the predicted box has $< 0.5$ IoU with the ground truth are in yellow.
  • Figure 5: Mask prior to bounding box drawing for different thresholds, including Otsu's method.
  • ...and 1 more figures