Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

Ziqi Gao; Weikai Huang; Jieyu Zhang; Aniruddha Kembhavi; Ranjay Krishna

Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

Ziqi Gao, Weikai Huang, Jieyu Zhang, Aniruddha Kembhavi, Ranjay Krishna

TL;DR

This work tackles the lack of compositional grounding in text-to-vision systems by introducing Generate Any Scene (GAS), a scene-graph–driven data engine that algorithmically enumerates vast graphs from a rich visual taxonomy and converts them into captions and QA pairs. GAS enables scalable self-improvement, targeted distillation from proprietary models, and low-cost semantic reward modeling via exhaustive scene-graph queries, demonstrating gains across text-to-image, video, and 3D tasks. Key contributions include a self-improving framework that yields ~4% SDv1.5 improvements, a distillation pipeline transferring DaLL-E 3 strengths to open models with ~10% TIFA gains, and a GRPO-based reward mechanism outperforming CLIP-based approaches on several benchmarks. The approach also strengthens content moderation by augmenting datasets with diverse, compositional synthetic captions, showcasing synthetic data as a scalable path to improved alignment and evaluation in Text-to-Vision systems.

Abstract

Recent advances in text-to-vision generation excel in visual fidelity but struggle with compositional generalization and semantic alignment. Existing datasets are noisy and weakly compositional, limiting models' understanding of complex scenes, while scalable solutions for dense, high-quality annotations remain a challenge. We introduce Generate Any Scene, a data engine that systematically enumerates scene graphs representing the combinatorial array of possible visual scenes. Generate Any Scene dynamically constructs scene graphs of varying complexity from a structured taxonomy of objects, attributes, and relations. Given a sampled scene graph, Generate Any Scene translates it into a caption for text-to-image or text-to-video generation; it also translates it into a set of visual question answers that allow automatic evaluation and reward modeling of semantic alignment. Using Generate Any Scene, we first design a self-improving framework where models iteratively enhance their performance using generated data. Stable Diffusion v1.5 achieves an average 4% improvement over baselines and surpassing fine-tuning on CC3M. Second, we also design a distillation algorithm to transfer specific strengths from proprietary models to their open-source counterparts. Using fewer than 800 synthetic captions, we fine-tune Stable Diffusion v1.5 and achieve a 10% increase in TIFA score on compositional and hard concept generation. Third, we create a reward model to align model generation with semantic accuracy at a low cost. Using GRPO algorithm, we fine-tune SimpleAR-0.5B-SFT and surpass CLIP-based methods by +5% on DPG-Bench. Finally, we apply these ideas to the downstream task of content moderation where we train models to identify challenging cases by learning from synthetic data.

Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

TL;DR

Abstract

Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (17)