Table of Contents
Fetching ...

Griffin: Generative Reference and Layout Guided Image Composition

Aryan Mikaeili, Amirhossein Alimohammadi, Negar Hassanpour, Ali Mahdavi-Amiri, Andrea Tagliasacchi

TL;DR

Griffin tackles the challenge of composing images from multiple sources with precise layout guidance in a training-free setting. It introduces a two-stage pipeline: initialize structure via masked IP-Adapter and refine with layout-controlled attention-sharing, complemented by dynamic layout updates to adapt masks during generation. The method achieves strong identity preservation and accurate layout adherence for both object-level and part-level composition, outperforming training-based baselines and layout-only methods in both qualitative and quantitative evaluations, including user studies. This work offers a practical, data-efficient tool for compositional image synthesis with potential extensions to style transfer and 3D texture applications.

Abstract

Text-to-image models have achieved a level of realism that enables the generation of highly convincing images. However, text-based control can be a limiting factor when more explicit guidance is needed. Defining both the content and its precise placement within an image is crucial for achieving finer control. In this work, we address the challenge of multi-image layout control, where the desired content is specified through images rather than text, and the model is guided on where to place each element. Our approach is training-free, requires a single image per reference, and provides explicit and simple control for object and part-level composition. We demonstrate its effectiveness across various image composition tasks.

Griffin: Generative Reference and Layout Guided Image Composition

TL;DR

Griffin tackles the challenge of composing images from multiple sources with precise layout guidance in a training-free setting. It introduces a two-stage pipeline: initialize structure via masked IP-Adapter and refine with layout-controlled attention-sharing, complemented by dynamic layout updates to adapt masks during generation. The method achieves strong identity preservation and accurate layout adherence for both object-level and part-level composition, outperforming training-based baselines and layout-only methods in both qualitative and quantitative evaluations, including user studies. This work offers a practical, data-efficient tool for compositional image synthesis with potential extensions to style transfer and 3D texture applications.

Abstract

Text-to-image models have achieved a level of realism that enables the generation of highly convincing images. However, text-based control can be a limiting factor when more explicit guidance is needed. Defining both the content and its precise placement within an image is crucial for achieving finer control. In this work, we address the challenge of multi-image layout control, where the desired content is specified through images rather than text, and the model is guided on where to place each element. Our approach is training-free, requires a single image per reference, and provides explicit and simple control for object and part-level composition. We demonstrate its effectiveness across various image composition tasks.

Paper Structure

This paper contains 29 sections, 12 equations, 14 figures, 3 tables.

Figures (14)

  • Figure 1: With Griffin, we can generate an image by defining both the content to be incorporated and its placement within the final composition. By conditioning on different images and specifying layouts using either bounding boxes or pixel masks, our method enables a wide range of compositional variations. The base prompt is "A portrait of a woman ...".
  • Figure 2: Control in image generation -- Naïve attention-sharing lacks explicit layout control. (a) and (b) are generated using the text: “A dog sitting in the yard. The dog is on the left side of the image.” but fail to reliably position the subject. In (c), masked IP-Adapter is used, but it struggles with identity preservation. (d) shows our method, which successfully maintains the subject's identity and adheres to the layout and text prompt.
  • Figure 3: Without proper initialization, attention sharing generates an image as if attention sharing were absent, leading to artifacts (a), (b). Using masked IP-Adapter for initialization allows attention sharing to effectively transfer appearance from the sources to each subject in the target (c).
  • Figure 4: Pipeline -- (a) We use IP-Adapter to initialize the structure of the target image based on the layouts. We then apply our layout-controlled attention sharing. (b) Our attention-sharing mechanism allows our generator to only attend to sub-portions of the input images, avoiding identity leakage. (c) We apply masked IP-Adapter with a high scale at the initialization stage to rapidly align the image with the input layout. At timestep $T_{LBA}$, attention-sharing begins, and the IP-Adapter scale is reduced. The displayed images are the denoised predictions at timesteps 1,000, $T_{LBA}$ and 0.
  • Figure 5: Dynamic layout update -- We extract DIFT tang2023emergent and DINO caron2021dino features from the source and target images, then compute pixel correspondences following zhang2023tale. We discard pixels without correspondence and group the remaining pixels by their corresponding source image. Farthest sampling is used to obtain subject-specific group points, which are then fed into SAM kirillov2023segany to generate updated masks.
  • ...and 9 more figures