ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

Xindi Wu; Dingli Yu; Yangsibo Huang; Olga Russakovsky; Sanjeev Arora

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

Xindi Wu, Dingli Yu, Yangsibo Huang, Olga Russakovsky, Sanjeev Arora

TL;DR

ConceptMix tackles the challenge of evaluating compositionality in Text-to-Image (T2I) models by replacing fixed prompts with a GPT-4o-driven two-stage benchmark. It generates prompts by combining one object with up to $k$ concepts across eight visual-concept categories and automatically grades the resulting images via concept-level Yes/No questions answered by GPT-4o, enabling per-concept scoring. The approach demonstrates that larger $k$ increases difficulty, revealing clear discrimination among models (with DALL·E 3 leading) and exposing limitations tied to training data diversity, particularly in LAION. The benchmark scales to millions of prompts, provides automated, interpretable grading, and offers guidance for data collection and model development to improve compositional generation in T2I systems. Overall, ConceptMix advances evaluation methodology, encouraging more nuanced progress toward truly compositional image generation.

Abstract

Compositionality is a critical capability in Text-to-Image (T2I) models, as it reflects their ability to understand and combine multiple concepts from text descriptions. Existing evaluations of compositional capability rely heavily on human-designed text prompts or fixed templates, limiting their diversity and complexity, and yielding low discriminative power. We propose ConceptMix, a scalable, controllable, and customizable benchmark which automatically evaluates compositional generation ability of T2I models. This is done in two stages. First, ConceptMix generates the text prompts: concretely, using categories of visual concepts (e.g., objects, colors, shapes, spatial relationships), it randomly samples an object and k-tuples of visual concepts, then uses GPT4-o to generate text prompts for image generation based on these sampled concepts. Second, ConceptMix evaluates the images generated in response to these prompts: concretely, it checks how many of the k concepts actually appeared in the image by generating one question per visual concept and using a strong VLM to answer them. Through administering ConceptMix to a diverse set of T2I models (proprietary as well as open ones) using increasing values of k, we show that our ConceptMix has higher discrimination power than earlier benchmarks. Specifically, ConceptMix reveals that the performance of several models, especially open models, drops dramatically with increased k. Importantly, it also provides insight into the lack of prompt diversity in widely-used training datasets. Additionally, we conduct extensive human studies to validate the design of ConceptMix and compare our automatic grading with human judgement. We hope it will guide future T2I model development.

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

TL;DR

concepts across eight visual-concept categories and automatically grades the resulting images via concept-level Yes/No questions answered by GPT-4o, enabling per-concept scoring. The approach demonstrates that larger

increases difficulty, revealing clear discrimination among models (with DALL·E 3 leading) and exposing limitations tied to training data diversity, particularly in LAION. The benchmark scales to millions of prompts, provides automated, interpretable grading, and offers guidance for data collection and model development to improve compositional generation in T2I systems. Overall, ConceptMix advances evaluation methodology, encouraging more nuanced progress toward truly compositional image generation.

Abstract

Paper Structure (46 sections, 15 figures, 10 tables)

This paper contains 46 sections, 15 figures, 10 tables.

Introduction
ConceptMix
Overview
Selecting Visual Concepts
Compositional Prompt Generation
Concept Evaluation
Human Evaluation
Experiments
Experimental Setup
Performance on Individual Concept Categories ($k=1$)
Performance of Compositional Generation ($k>1$)
ConceptMix has stronger discriminative power than other evaluation pipelines
Tracing the poor performance of models back to lack of diversity in training data
Discussion
Conclusion
...and 31 more sections

Figures (15)

Figure 1: Overview of ConceptMix benchmark for T2I models. Here we show some prompts generated using a different number of visual concepts. Each prompt uses a default object and a random selection of additional visual concepts from $k$ categories ($k=1...7$, and $k=0$ means one object, $k=1$ means an object with one additional concept, etc.) We show images generated by DALLÂ·E 3betker2023improving for these prompts. Note that the images are not part of ConceptMix benchmark; the benchmark is a distribution of visual prompts and corresponding evaluation questions. Our ConceptMix provides a scalable, controllable and customizable benchmark for compositional T2I evaluation.
Figure 2: ConceptMix. ConceptMix consists of two main stages: 1) Compositional Prompt Generation: We randomly select visual concepts from 8 categories and combine them to form generation statements and intermediate JSON files with GPT-4o assistance. The statements and JSON structure are then used by GPT-4o to generate a text prompt, which, if valid, is fed into a T2I model to produce an image. 2) Concept Evaluation: The generated image is graded based on how well it matches with each visual concept. This is done by converting the generation statements into questions and evaluating the answers. The image receives a score of 1 if it correctly matches all concepts, and 0 if any concept is not satisfied.
Figure 3: Our Scores vs. Human Scores. on ConceptMix with (a) different $k$ values for the DALLÂ·E 3 model, and (b) $k=3$ for different models.
Figure 4: T2VScore lin2024evaluating vs. Human Scores on ConceptMix with (a) different $k$ values for the DALLÂ·E 3 model, and (b) $k=3$ for different models.
Figure 5: Performance Across Concept Categories. We evaluate the performance of T2I models across different concept categories. Color and style are easier, with all models achieving high scores. Performance is lower for generating specific numbers of objects and spatial relationships, with varying results for texture and size. Overall, DALLÂ·E 3 outperforms others in all categories.
...and 10 more figures

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

TL;DR

Abstract

ConceptMix: A Compositional Image Generation Benchmark with Controllable Difficulty

Authors

TL;DR

Abstract

Table of Contents

Figures (15)