Table of Contents
Fetching ...

Do I look like a `cat.n.01` to you? A Taxonomy Image Generation Benchmark

Viktor Moskvoretskii, Alina Lobanova, Ekaterina Neminova, Chris Biemann, Alexander Panchenko, Irina Nikishina

TL;DR

This work investigates the feasibility of zero-shot taxonomy image generation by text-to-image models and introduces the Taxonomy Image Generation benchmark, featuring nine theory-grounded metrics and nine practical evaluations. It assembles three datasets (Easy Concepts, Random WordNet splits, and LLM-predicted concepts) and evaluates 12 open-source TTI models, including a retrieval baseline, using both human and GPT-4 pairwise preferences encoded via Elo-style rankings and a reward model. Key findings show that model rankings diverge from standard TTI benchmarks, with Playground-v2 and FLUX often leading across metrics, while retrieval lags behind; prompting with definitions generally enhances performance. The study demonstrates potential for automating structured data curation and WordNet extension through generated imagery, and it releases datasets and images for broader use while acknowledging limitations such as focus on open-source models and potential biases in AI judges.

Abstract

This paper explores the feasibility of using text-to-image models in a zero-shot setup to generate images for taxonomy concepts. While text-based methods for taxonomy enrichment are well-established, the potential of the visual dimension remains unexplored. To address this, we propose a comprehensive benchmark for Taxonomy Image Generation that assesses models' abilities to understand taxonomy concepts and generate relevant, high-quality images. The benchmark includes common-sense and randomly sampled WordNet concepts, alongside the LLM generated predictions. The 12 models are evaluated using 9 novel taxonomy-related text-to-image metrics and human feedback. Moreover, we pioneer the use of pairwise evaluation with GPT-4 feedback for image generation. Experimental results show that the ranking of models differs significantly from standard T2I tasks. Playground-v2 and FLUX consistently outperform across metrics and subsets and the retrieval-based approach performs poorly. These findings highlight the potential for automating the curation of structured data resources.

Do I look like a `cat.n.01` to you? A Taxonomy Image Generation Benchmark

TL;DR

This work investigates the feasibility of zero-shot taxonomy image generation by text-to-image models and introduces the Taxonomy Image Generation benchmark, featuring nine theory-grounded metrics and nine practical evaluations. It assembles three datasets (Easy Concepts, Random WordNet splits, and LLM-predicted concepts) and evaluates 12 open-source TTI models, including a retrieval baseline, using both human and GPT-4 pairwise preferences encoded via Elo-style rankings and a reward model. Key findings show that model rankings diverge from standard TTI benchmarks, with Playground-v2 and FLUX often leading across metrics, while retrieval lags behind; prompting with definitions generally enhances performance. The study demonstrates potential for automating structured data curation and WordNet extension through generated imagery, and it releases datasets and images for broader use while acknowledging limitations such as focus on open-source models and potential biases in AI judges.

Abstract

This paper explores the feasibility of using text-to-image models in a zero-shot setup to generate images for taxonomy concepts. While text-based methods for taxonomy enrichment are well-established, the potential of the visual dimension remains unexplored. To address this, we propose a comprehensive benchmark for Taxonomy Image Generation that assesses models' abilities to understand taxonomy concepts and generate relevant, high-quality images. The benchmark includes common-sense and randomly sampled WordNet concepts, alongside the LLM generated predictions. The 12 models are evaluated using 9 novel taxonomy-related text-to-image metrics and human feedback. Moreover, we pioneer the use of pairwise evaluation with GPT-4 feedback for image generation. Experimental results show that the ranking of models differs significantly from standard T2I tasks. Playground-v2 and FLUX consistently outperform across metrics and subsets and the retrieval-based approach performs poorly. These findings highlight the potential for automating the curation of structured data resources.

Paper Structure

This paper contains 51 sections, 4 theorems, 21 equations, 15 figures, 16 tables.

Key Result

Theorem 1

Let $V$ be a finite set of concepts, and suppose the prior distribution is uniform $P(V=v)=\frac{1}{|V|} \, \forall v \in V$. Then, $\forall x \in X \, \forall v \in V \, \arg\max_{i \in V} S_{\text{lemma}}(v, x) \propto \arg\max_{i \in V} P(V=v \mid X=x)$.

Figures (15)

  • Figure 1: Comparison of generations of the Playground model for the input prompt from the DiffusionDB dataset and available inputs from the WordNet-3.0. It can be seen, that the input from the TTI dataset is more detailed and the inner model representation could be misguiding even when the difinition is given.
  • Figure 2: The example of a generation and retrieval results for cigar lighter. As can be observed, the generation approach is significantly superior to the retrieval approach, as the retrieved image is quite unconventional.
  • Figure 3: LLM prompt example for evaluating text-to-image assistants.
  • Figure 4: ELO scores for human and GPT4 preferences. The prompt includes the definition. Overall Spearman correlation of model rankings remains significantly high at $0.92$, $p$-value $\leq 0.05$.
  • Figure 5: Distribution of preferences for Human and GPT across subsets in percentage. Prompt included definition.
  • ...and 10 more figures

Theorems & Definitions (12)

  • Definition : Lemma Similarity
  • Theorem 1
  • proof
  • Definition : Hypernym Similarity
  • Definition : Cohyponym Similarity
  • Theorem 2
  • proof
  • Definition : Specificity
  • Theorem 3
  • proof
  • ...and 2 more