IP-Composer: Semantic Composition of Visual Concepts
Sara Dorfman, Dana Cohen-Bar, Rinon Gal, Daniel Cohen-Or
TL;DR
IP-Composer presents a training-free framework for composing visual concepts from multiple image references by constructing concept-specific CLIP subspaces and replacing the reference embedding’s projection with concept projections. It builds projection matrices via LLM-generated variation prompts and SVD, then combines embeddings as $e_{comp}=e_{ref}-\sum P_{c_k} e_{ref}+\sum P_{c_k} e_{c_k}$ to synthesize new images with IP-Adapter. Compared to training-based baselines, this approach offers broader, more flexible concept control with reduced data and training requirements, while delivering competitive qualitative and quantitative performance and enabling multi-concept compositions. The method demonstrates strengths in integrating detailed visual cues from images with high-level textual prompts, opening practical avenues for creative content generation, albeit with limitations related to CLIP/diffusion entanglement and leakage across certain concept pairs.
Abstract
Content creators often draw inspiration from multiple visual sources, combining distinct elements to craft new compositions. Modern computational approaches now aim to emulate this fundamental creative process. Although recent diffusion models excel at text-guided compositional synthesis, text as a medium often lacks precise control over visual details. Image-based composition approaches can capture more nuanced features, but existing methods are typically limited in the range of concepts they can capture, and require expensive training procedures or specialized data. We present IP-Composer, a novel training-free approach for compositional image generation that leverages multiple image references simultaneously, while using natural language to describe the concept to be extracted from each image. Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image's CLIP embedding. We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text. Through comprehensive evaluation, we show that our approach enables more precise control over a larger range of visual concept compositions.
