StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images
Rushikesh Zawar, Shaurya Dewan, Andrew F. Luo, Margaret M. Henderson, Michael J. Tarr, Leila Wehbe
TL;DR
StableSemantics addresses the challenge of grounding semantic concepts in visually diverse real-world scenes by building a large synthetic language-vision dataset that couples human-curated prompts, LLM-generated natural language captions, multiple diffusion-generated images, and dense diffusion cross-attention maps tied to noun chunks. The authors introduce a fully documented pipeline that records seeds, derives noun-chunk semantic maps via DAAM-i2i, and analyzes the spatial distribution of concepts to enable open-vocabulary segmentation and captioning evaluation. This dataset is the first to systematically attach semantic attributions to diffusion-based generation, offering a foundation for analyzing grounding, distributional biases, and model interpretability in text-to-image synthesis. By making the data openly available under CC0, StableSemantics is positioned to accelerate advances in visual semantic understanding, benchmarking, and robust language-vision grounding across diverse naturalistic images.
Abstract
Understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. A key aspect of this challenge is that objects sharing similar semantic meanings or functions can exhibit striking visual differences, making accurate identification and categorization difficult. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. These frameworks account for the visual variability of objects, as well as complex object co-occurrences and sources of noise such as diverse lighting conditions. By leveraging large-scale datasets and cross-attention conditioning, these models generate detailed and contextually rich scene representations. This capability opens new avenues for improving object recognition and scene understanding in varied and challenging environments. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks. We explicitly leverage human-generated prompts that correspond to visually interesting stable diffusion generations, provide 10 generations per phrase, and extract cross-attention maps for each image. We explore the semantic distribution of generated images, examine the distribution of objects within images, and benchmark captioning and open vocabulary segmentation methods on our data. To the best of our knowledge, we are the first to release a diffusion dataset with semantic attributions. We expect our proposed dataset to catalyze advances in visual semantic understanding and provide a foundation for developing more sophisticated and effective visual models. Website: https://stablesemantics.github.io/StableSemantics
