Table of Contents
Fetching ...

fruit-SALAD: A Style Aligned Artwork Dataset to reveal similarity perception in image embeddings

Tillmann Ohm, Andres Karjus, Mikhail Tamm, Maximilian Schich

TL;DR

Style Aligned Artwork Datasets (SALAD) is introduced, and an example of fruit-SALAD with 10,000 images of fruit depictions is presented, showing salient differences in semantic category and style similarity weights across various computational models, including machine learning models, feature extraction algorithms, and complexity measures.

Abstract

The notion of visual similarity is essential for computer vision, and in applications and studies revolving around vector embeddings of images. However, the scarcity of benchmark datasets poses a significant hurdle in exploring how these models perceive similarity. Here we introduce Style Aligned Artwork Datasets (SALADs), and an example of fruit-SALAD with 10,000 images of fruit depictions. This combined semantic category and style benchmark comprises 100 instances each of 10 easy-to-recognize fruit categories, across 10 easy distinguishable styles. Leveraging a systematic pipeline of generative image synthesis, this visually diverse yet balanced benchmark demonstrates salient differences in semantic category and style similarity weights across various computational models, including machine learning models, feature extraction algorithms, and complexity measures, as well as conceptual models for reference. This meticulously designed dataset offers a controlled and balanced platform for the comparative analysis of similarity perception. The SALAD framework allows the comparison of how these models perform semantic category and style recognition task to go beyond the level of anecdotal knowledge, making it robustly quantifiable and qualitatively interpretable.

fruit-SALAD: A Style Aligned Artwork Dataset to reveal similarity perception in image embeddings

TL;DR

Style Aligned Artwork Datasets (SALAD) is introduced, and an example of fruit-SALAD with 10,000 images of fruit depictions is presented, showing salient differences in semantic category and style similarity weights across various computational models, including machine learning models, feature extraction algorithms, and complexity measures.

Abstract

The notion of visual similarity is essential for computer vision, and in applications and studies revolving around vector embeddings of images. However, the scarcity of benchmark datasets poses a significant hurdle in exploring how these models perceive similarity. Here we introduce Style Aligned Artwork Datasets (SALADs), and an example of fruit-SALAD with 10,000 images of fruit depictions. This combined semantic category and style benchmark comprises 100 instances each of 10 easy-to-recognize fruit categories, across 10 easy distinguishable styles. Leveraging a systematic pipeline of generative image synthesis, this visually diverse yet balanced benchmark demonstrates salient differences in semantic category and style similarity weights across various computational models, including machine learning models, feature extraction algorithms, and complexity measures, as well as conceptual models for reference. This meticulously designed dataset offers a controlled and balanced platform for the comparative analysis of similarity perception. The SALAD framework allows the comparison of how these models perform semantic category and style recognition task to go beyond the level of anecdotal knowledge, making it robustly quantifiable and qualitatively interpretable.
Paper Structure (11 sections, 7 figures, 2 tables)

This paper contains 11 sections, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of one instance of 10 fruit categories in 10 styles. The columns display fruit categories including from left to right: blueberries, fig, strawberry, apple, orange, pineapple, bananas, pear, avocado, kiwi. The rows display style labels trying to describe the style prompts from top to bottom: 'Crayon', 'Watercolor', 'Comic', 'Pixel', 'Patch', 'Cubic', 'Oilio', 'Glamour', 'Lomo', 'Pile'. The full dataset contains 100 instances for each fruit category-style combination.
  • Figure 2: Overview of the image generation process. (a) Image generation pipeline. 1. Style reference image generation with Stable Diffusion XL podell2023SDXL in manual trial-and-error fashion using text prompts of style description in combination with ‘an apple’. 2. Style aligned image generation hertz2024Style based on each style reference image using diffusion inversion and text prompts iterating over 10 fruit categories generating 100 instances each, resulting in 10,000 images. (b) Examples of tolerated and rejected results. Left: tolerated minor issues which do not impact recognition of category or style; right: rejected major issues which are either unrecognizable or inconsistent across the style.
  • Figure 3: Self-recognition Tests. Each cell represents the mean number of same instances in the top 100 nearest neighbors of its fruit category (column) and style (row) combination images. White cells without values have a perfect score of 100 out of 100 correctly recognized instances. Left: Maximum values from all computational models, taking into account that high scores within 100 out of 10,000 images reflect higher than chance results. Right: ResNet50_IN21k as an example model.
  • Figure 4: DINO-ViT-B-16_IN1k Heatmap indicating the mutual Mahalanobis distances of fruit-SALAD images. The matrix cells correspond to the mean of all 10,000 distance pairs of 100 by 100 instances of fruit-SALAD images. Below the diagonal: sorted by style first and fruit category second. Above the diagonal: sorted by fruit category first and style second. The color indicates the pairwise Mahalanobis distance of image embedding vectors obtained from the respective model or algorithm, from low to high (blue to yellow) while low values indicate higher similarity. The figure construction is comprehensive as the matrices are symmetric; diagonal cells can be left out.
  • Figure 5: Heatmaps indicating the mutual Mahalanobis distance of fruit-SALAD images according to different models (see Fig. \ref{['fig:dino-heatmaps']}). Top row from left to right: CLIP-ViT-B-16_L400M, DINOv2-B_LVD, CompressionEnsembles. Bottom row from left to right: VGG19_IN1k, ViT-B-32_IN21, style_blind. The matrix ordering is identical.
  • ...and 2 more figures