Griffin: Generative Reference and Layout Guided Image Composition
Aryan Mikaeili, Amirhossein Alimohammadi, Negar Hassanpour, Ali Mahdavi-Amiri, Andrea Tagliasacchi
TL;DR
Griffin tackles the challenge of composing images from multiple sources with precise layout guidance in a training-free setting. It introduces a two-stage pipeline: initialize structure via masked IP-Adapter and refine with layout-controlled attention-sharing, complemented by dynamic layout updates to adapt masks during generation. The method achieves strong identity preservation and accurate layout adherence for both object-level and part-level composition, outperforming training-based baselines and layout-only methods in both qualitative and quantitative evaluations, including user studies. This work offers a practical, data-efficient tool for compositional image synthesis with potential extensions to style transfer and 3D texture applications.
Abstract
Text-to-image models have achieved a level of realism that enables the generation of highly convincing images. However, text-based control can be a limiting factor when more explicit guidance is needed. Defining both the content and its precise placement within an image is crucial for achieving finer control. In this work, we address the challenge of multi-image layout control, where the desired content is specified through images rather than text, and the model is guided on where to place each element. Our approach is training-free, requires a single image per reference, and provides explicit and simple control for object and part-level composition. We demonstrate its effectiveness across various image composition tasks.
