Table of Contents
Fetching ...

Customized Generation Reimagined: Fidelity and Editability Harmonized

Jian Jin, Yang Shen, Zhenyong Fu, Jian Yang

TL;DR

This work tackles the inherent fidelity-editability trade-off in customized generation for pre-trained diffusion models. It introduces two innovations: Image-specific Context Optimization (ICO) to produce a more effective fine-tuned model through learnable image-context prompts, and Divide-Conquer-Integrate (DCI), an inference-time framework that decouples concept fidelity from prompt alignment via two collaborative branches and a Dual-Branch Integration Module (DBIM) with per-layer aggregation. The combination enables high-fidelity rendering of a new concept while maintaining strong agreement with varied prompts, even for concepts with weak generative priors. Practically, ICO+DCI provides a flexible, controllable approach to customized generation, capable of adapting to diverse query prompts and novel contexts, with potential extensions to layout and depth-conditioned generation in future work.

Abstract

Customized generation aims to incorporate a novel concept into a pre-trained text-to-image model, enabling new generations of the concept in novel contexts guided by textual prompts. However, customized generation suffers from an inherent trade-off between concept fidelity and editability, i.e., between precisely modeling the concept and faithfully adhering to the prompts. Previous methods reluctantly seek a compromise and struggle to achieve both high concept fidelity and ideal prompt alignment simultaneously. In this paper, we propose a Divide, Conquer, then Integrate (DCI) framework, which performs a surgical adjustment in the early stage of denoising to liberate the fine-tuned model from the fidelity-editability trade-off at inference. The two conflicting components in the trade-off are decoupled and individually conquered by two collaborative branches, which are then selectively integrated to preserve high concept fidelity while achieving faithful prompt adherence. To obtain a better fine-tuned model, we introduce an Image-specific Context Optimization} (ICO) strategy for model customization. ICO replaces manual prompt templates with learnable image-specific contexts, providing an adaptive and precise fine-tuning direction to promote the overall performance. Extensive experiments demonstrate the effectiveness of our method in reconciling the fidelity-editability trade-off.

Customized Generation Reimagined: Fidelity and Editability Harmonized

TL;DR

This work tackles the inherent fidelity-editability trade-off in customized generation for pre-trained diffusion models. It introduces two innovations: Image-specific Context Optimization (ICO) to produce a more effective fine-tuned model through learnable image-context prompts, and Divide-Conquer-Integrate (DCI), an inference-time framework that decouples concept fidelity from prompt alignment via two collaborative branches and a Dual-Branch Integration Module (DBIM) with per-layer aggregation. The combination enables high-fidelity rendering of a new concept while maintaining strong agreement with varied prompts, even for concepts with weak generative priors. Practically, ICO+DCI provides a flexible, controllable approach to customized generation, capable of adapting to diverse query prompts and novel contexts, with potential extensions to layout and depth-conditioned generation in future work.

Abstract

Customized generation aims to incorporate a novel concept into a pre-trained text-to-image model, enabling new generations of the concept in novel contexts guided by textual prompts. However, customized generation suffers from an inherent trade-off between concept fidelity and editability, i.e., between precisely modeling the concept and faithfully adhering to the prompts. Previous methods reluctantly seek a compromise and struggle to achieve both high concept fidelity and ideal prompt alignment simultaneously. In this paper, we propose a Divide, Conquer, then Integrate (DCI) framework, which performs a surgical adjustment in the early stage of denoising to liberate the fine-tuned model from the fidelity-editability trade-off at inference. The two conflicting components in the trade-off are decoupled and individually conquered by two collaborative branches, which are then selectively integrated to preserve high concept fidelity while achieving faithful prompt adherence. To obtain a better fine-tuned model, we introduce an Image-specific Context Optimization} (ICO) strategy for model customization. ICO replaces manual prompt templates with learnable image-specific contexts, providing an adaptive and precise fine-tuning direction to promote the overall performance. Extensive experiments demonstrate the effectiveness of our method in reconciling the fidelity-editability trade-off.

Paper Structure

This paper contains 13 sections, 5 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Concept fidelity and editability trade-off in customized generation. (a) is an illustration of the customization process, and (b) shows the corresponding variations in concept fidelity and editability during the customization process. (a): The distribution is portrayed in two dimensions. In the "pre-trained domain", deeper colors indicate higher editability. (b): Concept fidelity and editability outline a trade-off curve (red line) in the customization process. Our method (blue dotted line) is designed to free the fine-tuned model from this trade-off, and provide a powerful inference-time adjustment mechanism to achieve satisfactory generations for all query prompts using a single fine-tuned model.
  • Figure 2: Visualization of query examples varying in customization processes. We show the entire fine-tuning process (i.e., until overfitting). ICO alleviates premature overfitting, resulting in a longer process. The model fine-tuned using manual prompts is prone to premature overfitting, and the generated images suffer from concept distortion. ICO can mitigate these issues and achieve a better trade-off between concept fidelity and editability. We highlight the best generation with a yellow border for each query prompt. We can find that the optimal trade-off point varies for different query prompts, and the trade-off is hard to reconcile for customized generations with low model priors (e.g., sample (c)). DCI can effectively address these issues, generating satisfactory images for each prompt using one fine-tuned model. The prompt schemes of ICO used in “ICO” and “ICO+DCI” are different, which is detailed in Section \ref{['sec:exp_comp']}.
  • Figure 3: Overall framework of the prompt generating process in fine-tuning.
  • Figure 4: An overview of the proposed "Divide, Conquer, then Integrate" (DCI) framework. DCI divides the customized generation task into two collaborative branches, namely the concept branch and the auxiliary branch. The concept branch takes the original query prompt as its conditioning and responsible for generating concept with high fidelity. The auxiliary branch takes an auxiliary prompt as its conditioning and contributes concept-irrelevant content. During the latent denoising steps, the contents generated by two branches are selectively integrated using a Dual-Branch Integration Module (DBIM) in each cross-attention layer.
  • Figure 5: Visual comparsion of ICO with baselines. We illustrate the effectiveness of ICO by comparing it to three baselines. Reference images are shown on the left. First row: customizing various scenes for the target concept. Our method can better reconstruct the target concept while maintaining high text alignment. Second row : adding new objects. Third row : customizing artistic styles for the target concept. While other methods suffer from concept distortion, our method can generate the target concept with higher concept fidelity.
  • ...and 7 more figures