Table of Contents
Fetching ...

Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna

TL;DR

This work tackles the lack of compositional grounding in text-to-vision systems by introducing Generate Any Scene (GAS), a scene-graph–driven data engine that algorithmically enumerates vast graphs from a rich visual taxonomy and converts them into captions and QA pairs. GAS enables scalable self-improvement, targeted distillation from proprietary models, and low-cost semantic reward modeling via exhaustive scene-graph queries, demonstrating gains across text-to-image, video, and 3D tasks. Key contributions include a self-improving framework that yields ~4% SDv1.5 improvements, a distillation pipeline transferring DaLL-E 3 strengths to open models with ~10% TIFA gains, and a GRPO-based reward mechanism outperforming CLIP-based approaches on several benchmarks. The approach also strengthens content moderation by augmenting datasets with diverse, compositional synthetic captions, showcasing synthetic data as a scalable path to improved alignment and evaluation in Text-to-Vision systems.

Abstract

Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.

Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

TL;DR

This work tackles the lack of compositional grounding in text-to-vision systems by introducing Generate Any Scene (GAS), a scene-graph–driven data engine that algorithmically enumerates vast graphs from a rich visual taxonomy and converts them into captions and QA pairs. GAS enables scalable self-improvement, targeted distillation from proprietary models, and low-cost semantic reward modeling via exhaustive scene-graph queries, demonstrating gains across text-to-image, video, and 3D tasks. Key contributions include a self-improving framework that yields ~4% SDv1.5 improvements, a distillation pipeline transferring DaLL-E 3 strengths to open models with ~10% TIFA gains, and a GRPO-based reward mechanism outperforming CLIP-based approaches on several benchmarks. The approach also strengthens content moderation by augmenting datasets with diverse, compositional synthetic captions, showcasing synthetic data as a scalable path to improved alignment and evaluation in Text-to-Vision systems.

Abstract

Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.

Paper Structure

This paper contains 76 sections, 17 figures, 17 tables.

Figures (17)

  • Figure 1: The generation pipeline of Generate Any Scene. Step 1: Enumerate diverse scene graph structures under user-defined constraints. Step 2: Populate structures with sampled objects, attributes, and relations. Step 3: Sample scene attributes such as style, perspective, or time span. Step 4: Translate scene graph and attributes into coherent captions. Step 5: Automatically generate QA pairs covering all elements for evaluation and reward modeling.
  • Figure 2: Results for Self-Improving Models. Average VQA score of SDv1.5 fine-tuned on different data across 1K Generate Any Scene image/video evaluation set and GenAI-Bench image/video benchmark li2024genai.
  • Figure 3: Examples for Distilling Capabilities. Examples of images generated by DaLL-E 3, the original SDv1.5, and the fine-tuned versions. The left four captions demonstrate fine-tuning with multi-object captions generated by Generate Any Scene for better compositionality, while the right two columns focus on understanding hard concepts.
  • Figure 4: Results for Distilling Capabilities. The left two figures show the results for Distilling compositionality, while the rightmost figure shows the results for Distilling hard concepts understanding from DALL-E 3.
  • Figure 5: Comparison of generated images. Our reward model enables image generation with better semantic alignment, realism, and visual quality than baselines.
  • ...and 12 more figures