Table of Contents
Fetching ...

StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images

Rushikesh Zawar, Shaurya Dewan, Andrew F. Luo, Margaret M. Henderson, Michael J. Tarr, Leila Wehbe

TL;DR

StableSemantics addresses the challenge of grounding semantic concepts in visually diverse real-world scenes by building a large synthetic language-vision dataset that couples human-curated prompts, LLM-generated natural language captions, multiple diffusion-generated images, and dense diffusion cross-attention maps tied to noun chunks. The authors introduce a fully documented pipeline that records seeds, derives noun-chunk semantic maps via DAAM-i2i, and analyzes the spatial distribution of concepts to enable open-vocabulary segmentation and captioning evaluation. This dataset is the first to systematically attach semantic attributions to diffusion-based generation, offering a foundation for analyzing grounding, distributional biases, and model interpretability in text-to-image synthesis. By making the data openly available under CC0, StableSemantics is positioned to accelerate advances in visual semantic understanding, benchmarking, and robust language-vision grounding across diverse naturalistic images.

Abstract

Understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. A key aspect of this challenge is that objects sharing similar semantic meanings or functions can exhibit striking visual differences, making accurate identification and categorization difficult. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. These frameworks account for the visual variability of objects, as well as complex object co-occurrences and sources of noise such as diverse lighting conditions. By leveraging large-scale datasets and cross-attention conditioning, these models generate detailed and contextually rich scene representations. This capability opens new avenues for improving object recognition and scene understanding in varied and challenging environments. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks. We explicitly leverage human-generated prompts that correspond to visually interesting stable diffusion generations, provide 10 generations per phrase, and extract cross-attention maps for each image. We explore the semantic distribution of generated images, examine the distribution of objects within images, and benchmark captioning and open vocabulary segmentation methods on our data. To the best of our knowledge, we are the first to release a diffusion dataset with semantic attributions. We expect our proposed dataset to catalyze advances in visual semantic understanding and provide a foundation for developing more sophisticated and effective visual models. Website: https://stablesemantics.github.io/StableSemantics

StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images

TL;DR

StableSemantics addresses the challenge of grounding semantic concepts in visually diverse real-world scenes by building a large synthetic language-vision dataset that couples human-curated prompts, LLM-generated natural language captions, multiple diffusion-generated images, and dense diffusion cross-attention maps tied to noun chunks. The authors introduce a fully documented pipeline that records seeds, derives noun-chunk semantic maps via DAAM-i2i, and analyzes the spatial distribution of concepts to enable open-vocabulary segmentation and captioning evaluation. This dataset is the first to systematically attach semantic attributions to diffusion-based generation, offering a foundation for analyzing grounding, distributional biases, and model interpretability in text-to-image synthesis. By making the data openly available under CC0, StableSemantics is positioned to accelerate advances in visual semantic understanding, benchmarking, and robust language-vision grounding across diverse naturalistic images.

Abstract

Understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. A key aspect of this challenge is that objects sharing similar semantic meanings or functions can exhibit striking visual differences, making accurate identification and categorization difficult. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. These frameworks account for the visual variability of objects, as well as complex object co-occurrences and sources of noise such as diverse lighting conditions. By leveraging large-scale datasets and cross-attention conditioning, these models generate detailed and contextually rich scene representations. This capability opens new avenues for improving object recognition and scene understanding in varied and challenging environments. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks. We explicitly leverage human-generated prompts that correspond to visually interesting stable diffusion generations, provide 10 generations per phrase, and extract cross-attention maps for each image. We explore the semantic distribution of generated images, examine the distribution of objects within images, and benchmark captioning and open vocabulary segmentation methods on our data. To the best of our knowledge, we are the first to release a diffusion dataset with semantic attributions. We expect our proposed dataset to catalyze advances in visual semantic understanding and provide a foundation for developing more sophisticated and effective visual models. Website: https://stablesemantics.github.io/StableSemantics
Paper Structure (24 sections, 13 figures, 3 tables)

This paper contains 24 sections, 13 figures, 3 tables.

Figures (13)

  • Figure 1: Images and maps corresponding to select noun chunks from StableSemantics. Images are generated using natural language captions derived from human generated and curated prompts. For reproducibility, seeds are recorded for each generation. Noun chunks are extracted by performing dependency parsing the natural language captions. Semantic maps corresponding to each noun chunk is computed using the cross-attention maps with the DAAM tang2022daam method. Only a single attention map is shown here for each image, please see below for additional examples. Yellow indicates high relevance, black indicates low relevance.
  • Figure 2: Data collection and generation process. (1) We collect our data from Stable Diffusion Discord, specifically the showdown and pantheon channels which are derived from user rankings of images generated from public prompt submissions. (2) The prompts are cleaned using regex to remove common errors, and further processed using an LLM to generate natural language captions. (3) The natural language captions are provided to a Stable Diffusion XL model, while we record the attention attribution maps corresponding to noun chunks.
  • Figure 3: Histogram of dataset statistics.(a) We visualize the cosine CLIP similarities between generated images and original captions. (b) Number of tokens in the captions. (c) NSFW scores of the captions after LLM filtering. Scores measured by LLaMA Guard 2 for sexuality and hate.
  • Figure 4: Example of SDXL generated images from the captions, raw user prompts and LLM processed captions. Raw prompts from users often contain typos or take the form a non-natural language tag-like format. We instruct an LLM to transform the prompts into a natural language caption. Noun chunks (bolded and underlined) are derived from dependency parsing. Images are generated from the captions, with diffusion attribution maps recorded for the noun chunks.
  • Figure 5: Visualization of the dataset. We show example captions used for image generation, images generated from the captions, and select noun chunks and their corresponding attention attribution maps. We find that our dataset contains accurate localizations for different semantic concepts.
  • ...and 8 more figures