Table of Contents
Fetching ...

From Geometry to Culture: An Iterative VLM Layout Framework for Placing Objects in Complex 3D Scene Contexts

Yuto Asano, Naruya Kondo, Tatsuki Fushimi, Yoichi Ochiai

TL;DR

This work tackles the challenge of context-aware 3D object placement by introducing a VAC-guided, open-world pipeline that iteratively refines layouts through a Vision-Language Model. It defines four escalating context levels that incorporate physical constraints, affordances, social norms, and cultural rules into a unified optimization objective. The system combines a visual assistive cue framework with recursive GPT-based reasoning (GenerateGPT, WorkerGPT, JudgeGPT) to achieve natural placements without extensive pre-training, demonstrating strengths in rotation and distance control while revealing limits at high-context levels. Overall, the approach offers a practical path toward fully automated, culture-aware 3D scene composition and highlights areas for future speedups and interpretability improvements.

Abstract

3D layout tasks have traditionally concentrated on geometric constraints, but many practical applications demand richer contextual understanding that spans social interactions, cultural traditions, and usage conventions. Existing methods often rely on rule-based heuristics or narrowly trained learning models, making them difficult to generalize and frequently prone to orientation errors that break realism. To address these challenges, we define four escalating context levels, ranging from straightforward physical placement to complex cultural requirements such as religious customs and advanced social norms. We then propose a Vision-Language Model-based pipeline that inserts minimal visual cues for orientation guidance and employs iterative feedback to pinpoint, diagnose, and correct unnatural placements in an automated fashion. Each adjustment is revisited through the system's verification process until it achieves a coherent result, thereby eliminating the need for extensive user oversight or manual parameter tuning. Our experiments across these four context levels reveal marked improvements in rotation accuracy, distance control, and overall layout plausibility compared with native VLM. By reducing the dependence on pre-programmed constraints or prohibitively large training sets, our method enables fully automated scene composition for both everyday scenarios and specialized cultural tasks, moving toward a universally adaptable framework for 3D arrangement.

From Geometry to Culture: An Iterative VLM Layout Framework for Placing Objects in Complex 3D Scene Contexts

TL;DR

This work tackles the challenge of context-aware 3D object placement by introducing a VAC-guided, open-world pipeline that iteratively refines layouts through a Vision-Language Model. It defines four escalating context levels that incorporate physical constraints, affordances, social norms, and cultural rules into a unified optimization objective. The system combines a visual assistive cue framework with recursive GPT-based reasoning (GenerateGPT, WorkerGPT, JudgeGPT) to achieve natural placements without extensive pre-training, demonstrating strengths in rotation and distance control while revealing limits at high-context levels. Overall, the approach offers a practical path toward fully automated, culture-aware 3D scene composition and highlights areas for future speedups and interpretability improvements.

Abstract

3D layout tasks have traditionally concentrated on geometric constraints, but many practical applications demand richer contextual understanding that spans social interactions, cultural traditions, and usage conventions. Existing methods often rely on rule-based heuristics or narrowly trained learning models, making them difficult to generalize and frequently prone to orientation errors that break realism. To address these challenges, we define four escalating context levels, ranging from straightforward physical placement to complex cultural requirements such as religious customs and advanced social norms. We then propose a Vision-Language Model-based pipeline that inserts minimal visual cues for orientation guidance and employs iterative feedback to pinpoint, diagnose, and correct unnatural placements in an automated fashion. Each adjustment is revisited through the system's verification process until it achieves a coherent result, thereby eliminating the need for extensive user oversight or manual parameter tuning. Our experiments across these four context levels reveal marked improvements in rotation accuracy, distance control, and overall layout plausibility compared with native VLM. By reducing the dependence on pre-programmed constraints or prohibitively large training sets, our method enables fully automated scene composition for both everyday scenarios and specialized cultural tasks, moving toward a universally adaptable framework for 3D arrangement.

Paper Structure

This paper contains 21 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: There are four context levels in arrangement tasks. Level 1 optimizes object parameters considering physical constraints. Level 2 adds affordance adjustments, such as orienting a chair toward a desk. Level 3 goes beyond affordances to social norms, for example placing a knife to the right of a plate with the blade facing inward. Level 4 includes religious and cultural constraints understood only by certain groups.
  • Figure 2: Our system architecture. Thick arrows indicate the execution sequence, while thin arrows the supplementary information attached to each GPT request. The judge information includes the related object information. Although JudgeGPT is shown being called twice in this figure, it continues to run as long as inconsistencies in the scene arrangement are detected. JudgeGPT virtually updates the object's parameters to minimize the sum of three losses ($E_{collision\_total}+E_{distance}+E_{affordance}$). As the context level increases, additional error terms appear, requiring further optimization.
  • Figure 3: VAC candidates delivered as images. The front maker (a) renders only the front view, while the other version (b) includes three axes. (c) applies a front shader, and (d) shows the wireframe. (e) depicts Clearance circles and indices for both the table and chair.
  • Figure 4: 12 placement tasks with different context levels. As the level increases, social awareness and specialized knowledge become necessary. The number of objects to be placed ranges from one to four.
  • Figure 5: Major failures of this system: (a) Placing the pillow higher than the mattress (b) Hina dolls floating in midair (c) Keeper and goal reversed (d) Knife and fork swapped left to right (e) Komainu statues swapped left to right