GraPLUS: Graph-based Placement Using Semantics for Image Composition
Mir Mohammad Khaleghi, Mehran Safayani, Abdolreza Mirzaei
TL;DR
GraPLUS addresses plausible object placement by replacing pixel-centric reasoning with a semantic-first approach that leverages scene graphs and GPT-2 embeddings. The framework fuses a Graph Transformer Network with edge-aware attention, explicit spatial information, and a cross-modal attention mechanism to condition placement on the scene context, all trained with adversarial objectives. On the OPA dataset, GraPLUS achieves 92.1% placement accuracy and 28.83 FID, with human evaluators preferring GraPLUS in 52.1% of cases, while ablative analyses validate the contributions of scene graphs, GPT-2 enrichments, and the GTN architecture. This semantic-centric method improves placement plausibility and spatial precision, offering transferability across domains and potential for scalable, context-aware image composition in applications like AR and data augmentation.
Abstract
We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling nuanced understanding of object relationships and placement patterns. GraPLUS achieves placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.1% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 19 participants, our method was preferred in 52.1% of cases, significantly outperforming previous approaches. The framework's key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints.
