Table of Contents
Fetching ...

SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

Hou In Ivan Tam, Hou In Derek Pun, Austin T. Wang, Angel X. Chang, Manolis Savva

TL;DR

This work evaluates six recent scene generation approaches using SceneEval and demonstrates its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions.

Abstract

Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics often measure realism by comparing generated scenes to a set of ground-truth scenes, but they overlook how well scenes follow the input text and capture implicit expectations of plausibility. We present SceneEval, an evaluation framework designed to address these limitations. SceneEval introduces fine-grained metrics for explicit user requirements-including object counts, attributes, and spatial relationships-and complementary metrics for implicit expectations such as support, collisions, and navigability. Together, these provide interpretable and comprehensive assessments of scene quality. To ground evaluation, we curate SceneEval-500, a benchmark of 500 text descriptions with detailed annotations of expected scene properties. This dataset establishes a common reference for reproducible and systematic comparison across scene generation methods. We evaluate six recent scene generation approaches using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results identify significant gaps in current methods, underscoring the need for further research toward practical and controllable scene synthesis.

SceneEval: Evaluating Semantic Coherence in Text-Conditioned 3D Indoor Scene Synthesis

TL;DR

This work evaluates six recent scene generation approaches using SceneEval and demonstrates its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions.

Abstract

Despite recent advances in text-conditioned 3D indoor scene generation, there remain gaps in the evaluation of these methods. Existing metrics often measure realism by comparing generated scenes to a set of ground-truth scenes, but they overlook how well scenes follow the input text and capture implicit expectations of plausibility. We present SceneEval, an evaluation framework designed to address these limitations. SceneEval introduces fine-grained metrics for explicit user requirements-including object counts, attributes, and spatial relationships-and complementary metrics for implicit expectations such as support, collisions, and navigability. Together, these provide interpretable and comprehensive assessments of scene quality. To ground evaluation, we curate SceneEval-500, a benchmark of 500 text descriptions with detailed annotations of expected scene properties. This dataset establishes a common reference for reproducible and systematic comparison across scene generation methods. We evaluate six recent scene generation approaches using SceneEval and demonstrate its ability to provide detailed assessments of the generated scenes, highlighting strengths and areas for improvement across multiple dimensions. Our results identify significant gaps in current methods, underscoring the need for further research toward practical and controllable scene synthesis.

Paper Structure

This paper contains 44 sections, 1 equation, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Explicit vs. implicit requirements. Explicit requirements are communicated explicitly by the user in the text description, while implicit requirements are assumed but not necessarily stated.
  • Figure 2: Overview. Given a generated scene and its corresponding annotated properties, SceneEval first matches object instances in the scene to the annotated categories. It then evaluates the scene on a comprehensive set of fidelity and plausibility metrics.
  • Figure 3: Example entry of medium difficulty in SceneEval-500. The scene description describes a basement room, a rarely-seen type in existing datasets. The annotation includes the expected scene properties, such as number of objects, specified in the text.
  • Figure 4: Examples of scenes generated using text descriptions in SceneEval-500 and the corresponding evaluation results using SceneEval. Our dataset has scene descriptions with annotations of three difficulty levels: easy, medium, and hard. SceneEval provides a comprehensive evaluation of the generated scenes on fidelity and plausibility.
  • Figure 5: Word cloud showing the most frequent object categories in SceneEval-500.
  • ...and 3 more figures