Imagining the Unseen: Generative Location Modeling for Object Placement
Jooyeol Yun, Davide Abati, Mohamed Omran, Jaegul Choo, Amirhossein Habibian, Auke Wiggers
TL;DR
This work tackles the problem of locating plausible placements for non existing objects in a scene by proposing a generative location model. The model conditions on the input image and object class and autoregressively predicts bounding box coordinates, effectively handling the one to many nature of plausible placements and data sparsity. It leverages negative labels through Direct Preference Optimization to refine predictions, and demonstrates superior placement accuracy on the OPA dataset as well as improved realism in downstream object insertion tasks when paired with inpainting. The results indicate that explicit location modeling enhances both the quality and coherence of inserted objects, suggesting broad utility for automated content creation, data synthesis, and planning in robotics and VR. The approach also shows flexibility to extend to richer spatial representations such as depth or 3D layouts in the future.
Abstract
Location modeling, or determining where non-existing objects could feasibly appear in a scene, has the potential to benefit numerous computer vision tasks, from automatic object insertion to scene creation in virtual reality. Yet, this capability remains largely unexplored to date. In this paper, we develop a generative location model that, given an object class and an image, learns to predict plausible bounding boxes for such an object. Our approach first tokenizes the image and target object class, then decodes bounding box coordinates through an autoregressive transformer. This formulation effectively addresses two core challenges in locatio modeling: the inherent one-to-many nature of plausible locations, and the sparsity of existing location modeling datasets, where fewer than 1% of valid placements are labeled. Furthermore, we incorporate Direct Preference Optimization to leverage negative labels, refining the spatial predictions. Empirical evaluations reveal that our generative location model achieves superior placement accuracy on the OPA dataset as compared to discriminative baselines and image composition approaches. We further test our model in the context of object insertion, where it proposes locations for an off-the-shelf inpainting model to render objects. In this respect, our proposal exhibits improved visual coherence relative to state-of-the-art instruction-tuned editing methods, demonstrating a high-performing location model's utility in a downstream application.
