Table of Contents
Fetching ...

SpotDiff: Spotting and Disentangling Interference in Feature Space for Subject-Preserving Image Generation

Yongzhi Li, Saining Zhang, Yibing Chen, Boying Li, Yanxin Zhang, Xiaoyu Du

TL;DR

The paper tackles personalized image generation by addressing the entanglement between subject identity and nuisance factors like pose and background. It introduces SpotDiff, a learning-based framework that spots nuisance factors with pose and background experts and removes interference via orthogonality constraints in feature space, aided by a CLIP-based encoder and an alignment module. To enable principled training, the authors construct SpotDiff10k, a 10k-image dataset with controlled pose consistency and background variation. Experiments show robust subject preservation and controllable editing with competitive quality using only 10k training samples, demonstrating effective disentanglement and efficiency.

Abstract

Personalized image generation aims to faithfully preserve a reference subject's identity while adapting to diverse text prompts. Existing optimization-based methods ensure high fidelity but are computationally expensive, while learning-based approaches offer efficiency at the cost of entangled representations influenced by nuisance factors. We introduce SpotDiff, a novel learning-based method that extracts subject-specific features by spotting and disentangling interference. Leveraging a pre-trained CLIP image encoder and specialized expert networks for pose and background, SpotDiff isolates subject identity through orthogonality constraints in the feature space. To enable principled training, we introduce SpotDiff10k, a curated dataset with consistent pose and background variations. Experiments demonstrate that SpotDiff achieves more robust subject preservation and controllable editing than prior methods, while attaining competitive performance with only 10k training samples.

SpotDiff: Spotting and Disentangling Interference in Feature Space for Subject-Preserving Image Generation

TL;DR

The paper tackles personalized image generation by addressing the entanglement between subject identity and nuisance factors like pose and background. It introduces SpotDiff, a learning-based framework that spots nuisance factors with pose and background experts and removes interference via orthogonality constraints in feature space, aided by a CLIP-based encoder and an alignment module. To enable principled training, the authors construct SpotDiff10k, a 10k-image dataset with controlled pose consistency and background variation. Experiments show robust subject preservation and controllable editing with competitive quality using only 10k training samples, demonstrating effective disentanglement and efficiency.

Abstract

Personalized image generation aims to faithfully preserve a reference subject's identity while adapting to diverse text prompts. Existing optimization-based methods ensure high fidelity but are computationally expensive, while learning-based approaches offer efficiency at the cost of entangled representations influenced by nuisance factors. We introduce SpotDiff, a novel learning-based method that extracts subject-specific features by spotting and disentangling interference. Leveraging a pre-trained CLIP image encoder and specialized expert networks for pose and background, SpotDiff isolates subject identity through orthogonality constraints in the feature space. To enable principled training, we introduce SpotDiff10k, a curated dataset with consistent pose and background variations. Experiments demonstrate that SpotDiff achieves more robust subject preservation and controllable editing than prior methods, while attaining competitive performance with only 10k training samples.

Paper Structure

This paper contains 15 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Interference of background and pose. When the background changes, baseline methods exhibit noticeable variations in the generated subject appearance, while our method maintains a consistent subject identity. Furthermore, when editing the subject, baseline methods tend to replicate the pose from the input images, thereby failing to perform accurate subject editing. In contrast, our approach effectively decouples pose from identity, enabling precise subject manipulation.
  • Figure 2: Pipeline Overview: During inference, the image encoder maps the input image to a latent feature space, which is then processed by two expert networks to disentangle relevant semantic features using orthogonality constraints. The resulting feature vectors are aligned and concatenated with a pre-defined text prompt and encoded by the text encoder to form the multi-modal condition guiding the generation process. In the training pipeline, each input image is processed through a noise scheduler to obtain $Z_t$, and at each training step $t$, the model compares the predicted noise to get the gradient. The orange blocks are trainable, while the blue blocks are frozen.
  • Figure 3: SpotDiff framework. The image encoder extracts the feature vectors, which are then passed through expert networks to disentangle pose and background components.
  • Figure 4: Example from the SpotDiff10k dataset: Given an original image, we generate new subjects with varied appearances while maintaining the same pose. Subsequently, the background of each subject is altered. Each original image is expanded into approximately 100 unique variations.
  • Figure 5: Recontextualization comparison of our method with ELITE, Blip-diffusion, and MoMA. Each row shows the output based on the same input image and prompt.
  • ...and 3 more figures