Table of Contents
Fetching ...

Piece it Together: Part-Based Concepting with IP-Priors

Elad Richardson, Kfir Goldberg, Yuval Alaluf, Daniel Cohen-Or

TL;DR

PiT tackles ideation from sparse visual cues by learning a part-conditioned prior in the expressive $\mathcal{IP}^+$ space. It introduces IP-Prior, a lightweight diffusion-transformer that assembles given parts and samples plausible completions within a learned domain, and IP-LoRA to restore text conditioning without sacrificing reconstruction. By operating on a compact, richly descriptive embedding space rather than CLIP space, PiT achieves better reconstructions and enables semantic edits in $\mathcal{IP}^+$, while enabling style customization through LoRA. The approach is demonstrated across multiple domains, showing coherent composition from partial inputs, diverse outputs, and tunable text adherence, making it a practical tool for visual ideation and design workflows. Overall, PiT provides a flexible, domain-adaptive pipeline that combines part-based conditioning with principled priors to bridge concept ideation and high-quality image rendering.

Abstract

Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept-such as an uniquely structured wing, or a specific hairstyle-serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.

Piece it Together: Part-Based Concepting with IP-Priors

TL;DR

PiT tackles ideation from sparse visual cues by learning a part-conditioned prior in the expressive space. It introduces IP-Prior, a lightweight diffusion-transformer that assembles given parts and samples plausible completions within a learned domain, and IP-LoRA to restore text conditioning without sacrificing reconstruction. By operating on a compact, richly descriptive embedding space rather than CLIP space, PiT achieves better reconstructions and enables semantic edits in , while enabling style customization through LoRA. The approach is demonstrated across multiple domains, showing coherent composition from partial inputs, diverse outputs, and tunable text adherence, making it a practical tool for visual ideation and design workflows. Overall, PiT provides a flexible, domain-adaptive pipeline that combines part-based conditioning with principled priors to bridge concept ideation and high-quality image rendering.

Abstract

Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept-such as an uniquely structured wing, or a specific hairstyle-serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.

Paper Structure

This paper contains 28 sections, 2 equations, 20 figures, 1 table.

Figures (20)

  • Figure 1: Using a dedicated prior for the target domain, our method, Piece it Together (PiT), effectively completes missing information by seamlessly integrating given elements into a coherent composition while adding the necessary missing pieces needed for the complete concept to reside in the prior domain.
  • Figure 2: Semantic Manipulation in CLIP Space vs. $\mathcal{IP}+$ Space. We encode the input image (left) into two different embedding spaces, modify its latent representation by traversing each space, and render the edited image using SDXL podell2023sdxl. As shown, CLIP struggles to both reconstruct the concept and follow the desired edit, whereas in $\mathcal{IP}+$ space, the rendered images are faithful both to the concept and the desired edit across the entire range.
  • Figure 3: Piece-it-Together Overview. Given an input image, we extract its semantic components (e.g., using SAM kirillov2023segment) and encode each image patch into the $\mathcal{IP}+$ space using frozen IP-Adapter+ (IP-A+) blocks (shown in yellow). The resulting set of compact image embeddings are then passed together through our IP-Prior model (green), which also receives a noised image embedding representing our desired complete concept. The IP-Prior model outputs a cleaned image embedding that captures the intended concept, which is subsequently used to generate the final concept image using SDXL podell2023sdxl (blue). At inference time, users can provide a varying number of object-part images to generate a new concept that aligns with the learned distribution.
  • Figure 4: Generated Data Samples. We present sample images generated using FLUX-Schnell blackforest2024flux, which are used to train our IP-Prior model for the "creatures" domain.
  • Figure 5: Recovering the Text Adherence via IP-LoRA. IP-Adapter+ enables rendering generated concepts via SDXL podell2023sdxl but often struggles with text adherence. To address this, we fine-tune a LoRA adapter over paired examples, where the conditioning image has a clean background and the target image places the object in a scene described using a text prompt. This lightweight training (using just $50$ prompts) effectively restores text control while maintaining visual fidelity.
  • ...and 15 more figures