Table of Contents
Fetching ...

GraPLUS: Graph-based Placement Using Semantics for Image Composition

Mir Mohammad Khaleghi, Mehran Safayani, Abdolreza Mirzaei

TL;DR

GraPLUS addresses plausible object placement by replacing pixel-centric reasoning with a semantic-first approach that leverages scene graphs and GPT-2 embeddings. The framework fuses a Graph Transformer Network with edge-aware attention, explicit spatial information, and a cross-modal attention mechanism to condition placement on the scene context, all trained with adversarial objectives. On the OPA dataset, GraPLUS achieves 92.1% placement accuracy and 28.83 FID, with human evaluators preferring GraPLUS in 52.1% of cases, while ablative analyses validate the contributions of scene graphs, GPT-2 enrichments, and the GTN architecture. This semantic-centric method improves placement plausibility and spatial precision, offering transferability across domains and potential for scalable, context-aware image composition in applications like AR and data augmentation.

Abstract

We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling nuanced understanding of object relationships and placement patterns. GraPLUS achieves placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.1% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 19 participants, our method was preferred in 52.1% of cases, significantly outperforming previous approaches. The framework's key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints.

GraPLUS: Graph-based Placement Using Semantics for Image Composition

TL;DR

GraPLUS addresses plausible object placement by replacing pixel-centric reasoning with a semantic-first approach that leverages scene graphs and GPT-2 embeddings. The framework fuses a Graph Transformer Network with edge-aware attention, explicit spatial information, and a cross-modal attention mechanism to condition placement on the scene context, all trained with adversarial objectives. On the OPA dataset, GraPLUS achieves 92.1% placement accuracy and 28.83 FID, with human evaluators preferring GraPLUS in 52.1% of cases, while ablative analyses validate the contributions of scene graphs, GPT-2 enrichments, and the GTN architecture. This semantic-centric method improves placement plausibility and spatial precision, offering transferability across domains and potential for scalable, context-aware image composition in applications like AR and data augmentation.

Abstract

We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling nuanced understanding of object relationships and placement patterns. GraPLUS achieves placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.1% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 19 participants, our method was preferred in 52.1% of cases, significantly outperforming previous approaches. The framework's key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints.

Paper Structure

This paper contains 22 sections, 26 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Overview of the GraPLUS framework architecture. The generator (left, green) processes background images through scene graph generation and enhancement, while the discriminator (center) evaluates composition quality. The embedding block (right, purple) provides semantic enrichment using GPT-2. The legend (bottom right) explains the visual elements used in the diagram.
  • Figure 2: Overview of Scene Graph Generation (SGG). Given an input image ($I_{bg}$), the SGG module generates (a) a scene graph capturing object relationships and their interactions (nodes represent objects and edges represent relationships), and (b) corresponding bounding box detections for each object.
  • Figure 3: Comparison of models for object placement. Each column corresponds to a model, and each row represents a specific index. The placed foreground objects are highlighted by the red outline, indicating the predicted placement boundaries. Best viewed with zoom-in.