Table of Contents
Fetching ...

ComFusion: Personalized Subject Generation in Multiple Specific Scenes From Single Image

Yan Hong, Jianfu Zhang

TL;DR

ComFusion tackles the challenge of personalized text-to-image generation from limited subject data by preserving both subject identity and scene semantics. It introduces a two-stream framework: a composite stream with a class-scene prior loss to retain pretrained priors, and a fusion stream with visual-textual matching losses to align instance visuals with scene prompts, using coarse intermediate denoisings to guide fusion. The method jointly optimizes an instance finetune loss, class-scene prior loss, and cross-modal fusion losses, achieving state-of-the-art instance and scene fidelity in one-shot/few-shot personalization. Extensive experiments on a combined TI and DreamBooth dataset show quantitative and qualitative improvements over baselines, with robust ablations confirming the effectiveness of each component and the balance between fidelity and diversity.

Abstract

Recent advancements in personalizing text-to-image (T2I) diffusion models have shown the capability to generate images based on personalized visual concepts using a limited number of user-provided examples. However, these models often struggle with maintaining high visual fidelity, particularly in manipulating scenes as defined by textual inputs. Addressing this, we introduce ComFusion, a novel approach that leverages pretrained models generating composition of a few user-provided subject images and predefined-text scenes, effectively fusing visual-subject instances with textual-specific scenes, resulting in the generation of high-fidelity instances within diverse scenes. ComFusion integrates a class-scene prior preservation regularization, which leverages composites the subject class and scene-specific knowledge from pretrained models to enhance generation fidelity. Additionally, ComFusion uses coarse generated images, ensuring they align effectively with both the instance image and scene texts. Consequently, ComFusion maintains a delicate balance between capturing the essence of the subject and maintaining scene fidelity.Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.

ComFusion: Personalized Subject Generation in Multiple Specific Scenes From Single Image

TL;DR

ComFusion tackles the challenge of personalized text-to-image generation from limited subject data by preserving both subject identity and scene semantics. It introduces a two-stream framework: a composite stream with a class-scene prior loss to retain pretrained priors, and a fusion stream with visual-textual matching losses to align instance visuals with scene prompts, using coarse intermediate denoisings to guide fusion. The method jointly optimizes an instance finetune loss, class-scene prior loss, and cross-modal fusion losses, achieving state-of-the-art instance and scene fidelity in one-shot/few-shot personalization. Extensive experiments on a combined TI and DreamBooth dataset show quantitative and qualitative improvements over baselines, with robust ablations confirming the effectiveness of each component and the balance between fidelity and diversity.

Abstract

Recent advancements in personalizing text-to-image (T2I) diffusion models have shown the capability to generate images based on personalized visual concepts using a limited number of user-provided examples. However, these models often struggle with maintaining high visual fidelity, particularly in manipulating scenes as defined by textual inputs. Addressing this, we introduce ComFusion, a novel approach that leverages pretrained models generating composition of a few user-provided subject images and predefined-text scenes, effectively fusing visual-subject instances with textual-specific scenes, resulting in the generation of high-fidelity instances within diverse scenes. ComFusion integrates a class-scene prior preservation regularization, which leverages composites the subject class and scene-specific knowledge from pretrained models to enhance generation fidelity. Additionally, ComFusion uses coarse generated images, ensuring they align effectively with both the instance image and scene texts. Consequently, ComFusion maintains a delicate balance between capturing the essence of the subject and maintaining scene fidelity.Extensive evaluations of ComFusion against various baselines in T2I personalization have demonstrated its qualitative and quantitative superiority.
Paper Structure (14 sections, 6 equations, 5 figures, 3 tables)

This paper contains 14 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Contrasting with existing methods ruiz2023dreamboothavrahami2023break , which often face challenges in simultaneously preserving instance fidelity and scene fidelity, ComFusion skillfully composites the instance image with textual prompts and fuses the visual details of the subject instance with the textual variations of the scenes, yielding the creation of plausible, personalized images that exhibit a rich diversity.
  • Figure 2: The illustration of ComFusion finetuning framework. We show an example of one-shot personalized generation setting and please note ComFusion can be applied to few-shot settings. ComFusion consists of a composite stream (highlighted with green and orange arrows, details in \ref{['sec:finetune']} and \ref{['sec:composite']}) and a fusion stream (highlighted with blue arrows, details in \ref{['sec:fuse']}).
  • Figure 3: The coarse generated results $\tilde{\bm{x}}^{IS}_k$ by supervision of visual-textual fusion loss $\{\mathcal{L}_F^I,\mathcal{L}_F^S\}$ under denoising steps $\tau$ sampled from $\{1,3,5\}$. The instance in coarse generated images is similar to instance image $\bm{x}^I$ , while it maintains consistent with specific scene in prior images $\bm{x}^{CS}_k$.
  • Figure 4: Images generated by DreamBooth ruiz2023dreambooth, TI gal2022image,CD kumari2023multi,XTI voynov2023p+,ELITE wei2023elite, Break-A-Scene avrahami2023break, and our proposed ComFusion in multiple specific scenes from a single instance image.
  • Figure 5: Visual ablative results of ComFusion.