Table of Contents
Fetching ...

IP-Composer: Semantic Composition of Visual Concepts

Sara Dorfman, Dana Cohen-Bar, Rinon Gal, Daniel Cohen-Or

TL;DR

IP-Composer presents a training-free framework for composing visual concepts from multiple image references by constructing concept-specific CLIP subspaces and replacing the reference embedding’s projection with concept projections. It builds projection matrices via LLM-generated variation prompts and SVD, then combines embeddings as $e_{comp}=e_{ref}-\sum P_{c_k} e_{ref}+\sum P_{c_k} e_{c_k}$ to synthesize new images with IP-Adapter. Compared to training-based baselines, this approach offers broader, more flexible concept control with reduced data and training requirements, while delivering competitive qualitative and quantitative performance and enabling multi-concept compositions. The method demonstrates strengths in integrating detailed visual cues from images with high-level textual prompts, opening practical avenues for creative content generation, albeit with limitations related to CLIP/diffusion entanglement and leakage across certain concept pairs.

Abstract

Content creators often draw inspiration from multiple visual sources, combining distinct elements to craft new compositions. Modern computational approaches now aim to emulate this fundamental creative process. Although recent diffusion models excel at text-guided compositional synthesis, text as a medium often lacks precise control over visual details. Image-based composition approaches can capture more nuanced features, but existing methods are typically limited in the range of concepts they can capture, and require expensive training procedures or specialized data. We present IP-Composer, a novel training-free approach for compositional image generation that leverages multiple image references simultaneously, while using natural language to describe the concept to be extracted from each image. Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image's CLIP embedding. We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text. Through comprehensive evaluation, we show that our approach enables more precise control over a larger range of visual concept compositions.

IP-Composer: Semantic Composition of Visual Concepts

TL;DR

IP-Composer presents a training-free framework for composing visual concepts from multiple image references by constructing concept-specific CLIP subspaces and replacing the reference embedding’s projection with concept projections. It builds projection matrices via LLM-generated variation prompts and SVD, then combines embeddings as to synthesize new images with IP-Adapter. Compared to training-based baselines, this approach offers broader, more flexible concept control with reduced data and training requirements, while delivering competitive qualitative and quantitative performance and enabling multi-concept compositions. The method demonstrates strengths in integrating detailed visual cues from images with high-level textual prompts, opening practical avenues for creative content generation, albeit with limitations related to CLIP/diffusion entanglement and leakage across certain concept pairs.

Abstract

Content creators often draw inspiration from multiple visual sources, combining distinct elements to craft new compositions. Modern computational approaches now aim to emulate this fundamental creative process. Although recent diffusion models excel at text-guided compositional synthesis, text as a medium often lacks precise control over visual details. Image-based composition approaches can capture more nuanced features, but existing methods are typically limited in the range of concepts they can capture, and require expensive training procedures or specialized data. We present IP-Composer, a novel training-free approach for compositional image generation that leverages multiple image references simultaneously, while using natural language to describe the concept to be extracted from each image. Our method builds on IP-Adapter, which synthesizes novel images conditioned on an input image's CLIP embedding. We extend this approach to multiple visual inputs by crafting composite embeddings, stitched from the projections of multiple input images onto concept-specific CLIP-subspaces identified through text. Through comprehensive evaluation, we show that our approach enables more precise control over a larger range of visual concept compositions.

Paper Structure

This paper contains 18 sections, 5 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Method overview for a 2-image composition scenario. (top) We use an LLM to generate texts describing possible variations of a concept we want to extract from the concept-image. We encode the responses using CLIP, and find the embedding-subspace that they span. (bottom) We generate a composite CLIP-embedding by replacing the projection of the reference image on this embedding-subspace with the matching projection of the concept-image. The composite embedding can be used by an off-the-shelf IP-Adapter to generate images combining the reference and the visual concept. The same approach can be applied with additional concept images.
  • Figure 2: Examples of visual concept compositions enabled by IP-Composer. Our method can seamlessly tackle texture-based tasks like colorization and pattern changes, but also convey layouts or modify object-level content.
  • Figure 3: Results demonstrating our method’s ability to integrate text prompts alongside image embeddings, leveraging IP-Adapter’s built-in support for text conditioning.
  • Figure 4: Quantitative results mimic our qualitative observations, showing that IP-Composer can successfully compete with and even outperform existing training-based methods.
  • Figure 5: User study results. Our approach is commonly preferred by users, even when compared with training-based methods.
  • ...and 7 more figures