Table of Contents
Fetching ...

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

Zuyao Chen, Jinlin Wu, Zhen Lei, Chang Wen Chen

TL;DR

This work targets the gap in evaluating factual consistency for scene-graph-to-image generation. It introduces Scene-Bench, combining MegaSG, a large-scale scene-graph dataset, and SGScore, a multimodal-LLM–based metric that measures ObjectRecall and RelationRecall to assess scene fidelity. A scene-graph feedback pipeline further refines generated images by identifying and correcting discrepancies between the scene graph and the image, achieving significant improvements in object and relationship recall, particularly on complex scenes. The results demonstrate that SGScore provides a more sensitive fidelity measure than traditional metrics and that the feedback loop yields practical gains in controllable image generation.

Abstract

While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.

What Makes a Scene ? Scene Graph-based Evaluation and Feedback for Controllable Generation

TL;DR

This work targets the gap in evaluating factual consistency for scene-graph-to-image generation. It introduces Scene-Bench, combining MegaSG, a large-scale scene-graph dataset, and SGScore, a multimodal-LLM–based metric that measures ObjectRecall and RelationRecall to assess scene fidelity. A scene-graph feedback pipeline further refines generated images by identifying and correcting discrepancies between the scene graph and the image, achieving significant improvements in object and relationship recall, particularly on complex scenes. The results demonstrate that SGScore provides a more sensitive fidelity measure than traditional metrics and that the feedback loop yields practical gains in controllable image generation.

Abstract

While text-to-image generation has been extensively studied, generating images from scene graphs remains relatively underexplored, primarily due to challenges in accurately modeling spatial relationships and object interactions. To fill this gap, we introduce Scene-Bench, a comprehensive benchmark designed to evaluate and enhance the factual consistency in generating natural scenes. Scene-Bench comprises MegaSG, a large-scale dataset of one million images annotated with scene graphs, facilitating the training and fair comparison of models across diverse and complex scenes. Additionally, we propose SGScore, a novel evaluation metric that leverages chain-of-thought reasoning capabilities of multimodal large language models (LLMs) to assess both object presence and relationship accuracy, offering a more effective measure of factual consistency than traditional metrics like FID and CLIPScore. Building upon this evaluation framework, we develop a scene graph feedback pipeline that iteratively refines generated images by identifying and correcting discrepancies between the scene graph and the image. Extensive experiments demonstrate that Scene-Bench provides a more comprehensive and effective evaluation framework compared to existing benchmarks, particularly for complex scene generation. Furthermore, our feedback strategy significantly enhances the factual consistency of image generation models, advancing the field of controllable image generation.

Paper Structure

This paper contains 28 sections, 6 equations, 13 figures, 10 tables, 1 algorithm.

Figures (13)

  • Figure 1: A comparison of CLIPScore hessel2021clipscore and the proposed SGScore for evaluating factual consistency. SGScore can distinguish such relationship discrepancies, while CLIPScore often overlooks them.
  • Figure 2: Overview of the Scene-Bench. (a) Scene graphs are generated from images using a multimodal LLM (M-LLM), capturing object relationships and interactions. Scene Diversity and Scene Complexity guide sampling to ensure dataset balance. (b) Scene distribution across categories highlights the diversity in People-Centric and Non-People-Centric themes. (c) Scene graph-based evaluation and feedback leverages the M-LLM to calculate object and relationship recall, generating an SGScore metric that quantifies factual consistency between the generated image and the intended scene. The feedback identifies and corrects discrepancies, iteratively refining the generated image.
  • Figure 3: Comparison of model performances using SGScore across various scene categories.
  • Figure 4: Comparison of Scene Graph-based Image Generation across Different Models. Each row displays a unique scene graph used as input for image generation. We present the SGScore below each generated image to quantify the consistency between the scene graph and the generated output.
  • Figure 5: Illustration of scene categories in the MegaSG dataset. The image shows various themes, such as People-Centric (e.g., social interaction, individual activities) and Non-People-Centric (e.g., nature, urban environments). The caption is provided for illustrative purposes and generated using BLIP-2 li2023blip, and the scene graph is constructed as described in \ref{['sec:megasg']} and \ref{['sec:supp_mega']}.
  • ...and 8 more figures