Table of Contents
Fetching ...

LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Gyeom Hwangbo, Hyungjoo Chae, Minseok Kang, Hyeonjong Ju, Soohyun Oh, Jinyoung Yeo

TL;DR

LEGO-Eval tackles the mismatch between fine-grained textual instructions and 3D scene synthesis by introducing a tool-augmented evaluation framework that grounds scene components through a diverse toolset. It is paired with LEGO-Bench, a benchmark of 130 real-world-like instructions containing 1,250 constraints to rigorously test alignment. Empirical results show LEGO-Eval substantially outperforms existing evaluation approaches (e.g., improving alignment metrics over $F1$ and Cohen’s $\kappa$ by notable margins) and reveal that current text-guided synthesis methods struggle to fully satisfy complex instructions, with only about 10% achieving complete constraint satisfaction. The framework demonstrates potential as an automated, interpretable evaluation backbone for advancing grounded, realistic 3D scene generation and embodied agent training.

Abstract

Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.

LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

TL;DR

LEGO-Eval tackles the mismatch between fine-grained textual instructions and 3D scene synthesis by introducing a tool-augmented evaluation framework that grounds scene components through a diverse toolset. It is paired with LEGO-Bench, a benchmark of 130 real-world-like instructions containing 1,250 constraints to rigorously test alignment. Empirical results show LEGO-Eval substantially outperforms existing evaluation approaches (e.g., improving alignment metrics over and Cohen’s by notable margins) and reveal that current text-guided synthesis methods struggle to fully satisfy complex instructions, with only about 10% achieving complete constraint satisfaction. The framework demonstrates potential as an automated, interpretable evaluation backbone for advancing grounded, realistic 3D scene generation and embodied agent training.

Abstract

Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.

Paper Structure

This paper contains 42 sections, 1 equation, 42 figures, 7 tables.

Figures (42)

  • Figure 1: LEGO-Eval performs multi-hop grounding using tool-retrieved multimodal information (left), whereas VLMs fail to ground pencils in the scene (right).
  • Figure 2: Overview of LEGO-Eval. LEGO-Eval plans tool execution using diverse tools, and selects arguments before executing each tool. Constraints are evaluated using the collected outputs.
  • Figure 3: Diverse tools included in our tool set.
  • Figure 4: Statistics of LEGO-Bench.
  • Figure 5: Distribution of tool types executed by LEGO-Eval during evaluation.
  • ...and 37 more figures