Table of Contents
Fetching ...

Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation

Chaehun Shin, Jooyoung Choi, Johan Barthelemy, Jungbeom Lee, Sungroh Yoon

TL;DR

This paper tackles the challenge of preserving fine-grained subject details in zero-shot subject-driven text-to-image generation. It introduces Subject Fidelity Optimization (SFO), a comparison-based fine-tuning framework that uses synthetic negative targets generated via Condition-Degradation Negative Sampling (CDNS) and emphasizes mid-generation diffusion timesteps to sharpen subject fidelity while maintaining text alignment. The method is grounded in a Bradley-Terry-style objective that compares positives against negatives relative to a reference, and it includes a theoretical rationale linking to mutual information through a flow-matching surrogate. Empirical results on DreamBench show that SFO with CDNS outperforms strong baselines in subject fidelity and achieves competitive text alignment, with ablations validating the contributions of CDNS, timestepping, and degradation strategies for informative negative targets.

Abstract

We present Subject Fidelity Optimization (SFO), a novel comparative learning framework for zero-shot subject-driven generation that enhances subject fidelity. Existing supervised fine-tuning methods, which rely only on positive targets and use the diffusion loss as in the pre-training stage, often fail to capture fine-grained subject details. To address this, SFO introduces additional synthetic negative targets and explicitly guides the model to favor positives over negatives through pairwise comparison. For negative targets, we propose Condition-Degradation Negative Sampling (CDNS), which automatically produces synthetic negatives tailored for subject-driven generation by introducing controlled degradations that emphasize subject fidelity and text alignment without expensive human annotations. Moreover, we reweight the diffusion timesteps to focus fine-tuning on intermediate steps where subject details emerge. Extensive experiments demonstrate that SFO with CDNS significantly outperforms recent strong baselines in terms of both subject fidelity and text alignment on a subject-driven generation benchmark. Project page: https://subjectfidelityoptimization.github.io/

Negative-Guided Subject Fidelity Optimization for Zero-Shot Subject-Driven Generation

TL;DR

This paper tackles the challenge of preserving fine-grained subject details in zero-shot subject-driven text-to-image generation. It introduces Subject Fidelity Optimization (SFO), a comparison-based fine-tuning framework that uses synthetic negative targets generated via Condition-Degradation Negative Sampling (CDNS) and emphasizes mid-generation diffusion timesteps to sharpen subject fidelity while maintaining text alignment. The method is grounded in a Bradley-Terry-style objective that compares positives against negatives relative to a reference, and it includes a theoretical rationale linking to mutual information through a flow-matching surrogate. Empirical results on DreamBench show that SFO with CDNS outperforms strong baselines in subject fidelity and achieves competitive text alignment, with ablations validating the contributions of CDNS, timestepping, and degradation strategies for informative negative targets.

Abstract

We present Subject Fidelity Optimization (SFO), a novel comparative learning framework for zero-shot subject-driven generation that enhances subject fidelity. Existing supervised fine-tuning methods, which rely only on positive targets and use the diffusion loss as in the pre-training stage, often fail to capture fine-grained subject details. To address this, SFO introduces additional synthetic negative targets and explicitly guides the model to favor positives over negatives through pairwise comparison. For negative targets, we propose Condition-Degradation Negative Sampling (CDNS), which automatically produces synthetic negatives tailored for subject-driven generation by introducing controlled degradations that emphasize subject fidelity and text alignment without expensive human annotations. Moreover, we reweight the diffusion timesteps to focus fine-tuning on intermediate steps where subject details emerge. Extensive experiments demonstrate that SFO with CDNS significantly outperforms recent strong baselines in terms of both subject fidelity and text alignment on a subject-driven generation benchmark. Project page: https://subjectfidelityoptimization.github.io/

Paper Structure

This paper contains 34 sections, 11 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Our Subject Fidelity Optimization (SFO) framework improves subject fidelity in zero-shot subject-driven text-to-image generation by introducing negative targets and explicitly guiding the model regarding which aspects are desirable and which are not. The supervised fine-tuning results are obtained using our base model, OminiControl ominicontrol, and the results shown above on both sides are generated with the same seed and prompt (prompts are included in the Appendix).
  • Figure 2: Overall framework (a) Previous supervised fine-tuning methods utilize the triplet dataset to generate the target image conditioned on a given reference image and target text prompt. (b) From a supervised fine-tuned model, we synthesize negative target data with CDNS, extending the triplet dataset into a quadruplet dataset containing informative negatives. (c) We further fine-tune the supervised fine-tuned model with a quadruplet dataset with SFO to distinguish positive and negative target data given the same condition.
  • Figure 3: Dataset construction comparisons (a) We present examples of synthesized negative targets from each naïve Self-Play method and our CDNS with given conditions. (b) While the negative targets of naïve Self-Play have high similarity with positive samples, our negative targets from CDNS demonstrate diverse pairwise gaps between targets and enable more effective optimization.
  • Figure 4: Qualitative comparisons Our method captures fine-grained details better than the baselines, such as the font on the abdomen of the bear plushie (row 1) or the limbs of the monster toy (row 3). Results in each row are generated with the same random seed for fairness.
  • Figure A: Toy experiment results (a) Qualitative results: The top and bottom rows show images generated using the same random seed with each model, allowing for direct visual comparison. (b) Quantitative results: This plot shows the change in the proportion of images classified as red cars out of 100 generated samples, using a car color classifier.
  • ...and 6 more figures