Piece it Together: Part-Based Concepting with IP-Priors
Elad Richardson, Kfir Goldberg, Yuval Alaluf, Daniel Cohen-Or
TL;DR
PiT tackles ideation from sparse visual cues by learning a part-conditioned prior in the expressive $\mathcal{IP}^+$ space. It introduces IP-Prior, a lightweight diffusion-transformer that assembles given parts and samples plausible completions within a learned domain, and IP-LoRA to restore text conditioning without sacrificing reconstruction. By operating on a compact, richly descriptive embedding space rather than CLIP space, PiT achieves better reconstructions and enables semantic edits in $\mathcal{IP}^+$, while enabling style customization through LoRA. The approach is demonstrated across multiple domains, showing coherent composition from partial inputs, diverse outputs, and tunable text adherence, making it a practical tool for visual ideation and design workflows. Overall, PiT provides a flexible, domain-adaptive pipeline that combines part-based conditioning with principled priors to bridge concept ideation and high-quality image rendering.
Abstract
Advanced generative models excel at synthesizing images but often rely on text-based conditioning. Visual designers, however, often work beyond language, directly drawing inspiration from existing visual elements. In many cases, these elements represent only fragments of a potential concept-such as an uniquely structured wing, or a specific hairstyle-serving as inspiration for the artist to explore how they can come together creatively into a coherent whole. Recognizing this need, we introduce a generative framework that seamlessly integrates a partial set of user-provided visual components into a coherent composition while simultaneously sampling the missing parts needed to generate a plausible and complete concept. Our approach builds on a strong and underexplored representation space, extracted from IP-Adapter+, on which we train IP-Prior, a lightweight flow-matching model that synthesizes coherent compositions based on domain-specific priors, enabling diverse and context-aware generations. Additionally, we present a LoRA-based fine-tuning strategy that significantly improves prompt adherence in IP-Adapter+ for a given task, addressing its common trade-off between reconstruction quality and prompt adherence.
