From Geometry to Culture: An Iterative VLM Layout Framework for Placing Objects in Complex 3D Scene Contexts
Yuto Asano, Naruya Kondo, Tatsuki Fushimi, Yoichi Ochiai
TL;DR
This work tackles the challenge of context-aware 3D object placement by introducing a VAC-guided, open-world pipeline that iteratively refines layouts through a Vision-Language Model. It defines four escalating context levels that incorporate physical constraints, affordances, social norms, and cultural rules into a unified optimization objective. The system combines a visual assistive cue framework with recursive GPT-based reasoning (GenerateGPT, WorkerGPT, JudgeGPT) to achieve natural placements without extensive pre-training, demonstrating strengths in rotation and distance control while revealing limits at high-context levels. Overall, the approach offers a practical path toward fully automated, culture-aware 3D scene composition and highlights areas for future speedups and interpretability improvements.
Abstract
3D layout tasks have traditionally concentrated on geometric constraints, but many practical applications demand richer contextual understanding that spans social interactions, cultural traditions, and usage conventions. Existing methods often rely on rule-based heuristics or narrowly trained learning models, making them difficult to generalize and frequently prone to orientation errors that break realism. To address these challenges, we define four escalating context levels, ranging from straightforward physical placement to complex cultural requirements such as religious customs and advanced social norms. We then propose a Vision-Language Model-based pipeline that inserts minimal visual cues for orientation guidance and employs iterative feedback to pinpoint, diagnose, and correct unnatural placements in an automated fashion. Each adjustment is revisited through the system's verification process until it achieves a coherent result, thereby eliminating the need for extensive user oversight or manual parameter tuning. Our experiments across these four context levels reveal marked improvements in rotation accuracy, distance control, and overall layout plausibility compared with native VLM. By reducing the dependence on pre-programmed constraints or prohibitively large training sets, our method enables fully automated scene composition for both everyday scenarios and specialized cultural tasks, moving toward a universally adaptable framework for 3D arrangement.
