Table of Contents
Fetching ...

P3S-Diffusion:A Selective Subject-driven Generation Framework via Point Supervision

Junjie Hu, Shuyong Gao, Lingyi Hong, Qishan Wang, Yuzhou Zhao, Yan Wang, Wenqiang Zhang

TL;DR

P3S-Diffusion tackles selective subject-driven generation using minimal point supervision to distinguish target subjects in images with multiple similar instances. It introduces RNGR-Encoder to produce a subject-biased representation from patch-level CLIP similarities and a rough, inpainted mask, followed by Multi-layers Condition Injection that feeds subject features into a trainable U-Net copy. An Attention Consistency Loss aligns cross-attention between the trainable and frozen networks, while a timestep-based weight scheduler balances prompt fidelity and identity preservation. Empirical results on DreamBench show improved subject alignment, fidelity, and stability, with ablations highlighting the contributions of RNGR-Encoder, L_ac, and weight scheduling. The approach offers a lightweight, controllable path for personalized generation with minimal annotation cost, though it may struggle with highly detailed subjects or non-salient imagery.

Abstract

Recent research in subject-driven generation increasingly emphasizes the importance of selective subject features. Nevertheless, accurately selecting the content in a given reference image still poses challenges, especially when selecting the similar subjects in an image (e.g., two different dogs). Some methods attempt to use text prompts or pixel masks to isolate specific elements. However, text prompts often fall short in precisely describing specific content, and pixel masks are often expensive. To address this, we introduce P3S-Diffusion, a novel architecture designed for context-selected subject-driven generation via point supervision. P3S-Diffusion leverages minimal cost label (e.g., points) to generate subject-driven images. During fine-tuning, it can generate an expanded base mask from these points, obviating the need for additional segmentation models. The mask is employed for inpainting and aligning with subject representation. The P3S-Diffusion preserves fine features of the subjects through Multi-layers Condition Injection. Enhanced by the Attention Consistency Loss for improved training, extensive experiments demonstrate its excellent feature preservation and image generation capabilities.

P3S-Diffusion:A Selective Subject-driven Generation Framework via Point Supervision

TL;DR

P3S-Diffusion tackles selective subject-driven generation using minimal point supervision to distinguish target subjects in images with multiple similar instances. It introduces RNGR-Encoder to produce a subject-biased representation from patch-level CLIP similarities and a rough, inpainted mask, followed by Multi-layers Condition Injection that feeds subject features into a trainable U-Net copy. An Attention Consistency Loss aligns cross-attention between the trainable and frozen networks, while a timestep-based weight scheduler balances prompt fidelity and identity preservation. Empirical results on DreamBench show improved subject alignment, fidelity, and stability, with ablations highlighting the contributions of RNGR-Encoder, L_ac, and weight scheduling. The approach offers a lightweight, controllable path for personalized generation with minimal annotation cost, though it may struggle with highly detailed subjects or non-salient imagery.

Abstract

Recent research in subject-driven generation increasingly emphasizes the importance of selective subject features. Nevertheless, accurately selecting the content in a given reference image still poses challenges, especially when selecting the similar subjects in an image (e.g., two different dogs). Some methods attempt to use text prompts or pixel masks to isolate specific elements. However, text prompts often fall short in precisely describing specific content, and pixel masks are often expensive. To address this, we introduce P3S-Diffusion, a novel architecture designed for context-selected subject-driven generation via point supervision. P3S-Diffusion leverages minimal cost label (e.g., points) to generate subject-driven images. During fine-tuning, it can generate an expanded base mask from these points, obviating the need for additional segmentation models. The mask is employed for inpainting and aligning with subject representation. The P3S-Diffusion preserves fine features of the subjects through Multi-layers Condition Injection. Enhanced by the Attention Consistency Loss for improved training, extensive experiments demonstrate its excellent feature preservation and image generation capabilities.
Paper Structure (10 sections, 15 equations, 5 figures, 4 tables)

This paper contains 10 sections, 15 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Selective subject comparison. Our results not only excel in selective subject generations but also performs well in terms of fidelity to the reference subjects.
  • Figure 2: Overview of our method. Given point-image pairs $(p, I)$, the RNGR-Encoder will calculate the patch-level similarity on the hidden state feature from CLIP image encoder.It achieves obtaining a rough mask from point information without the need for any additional segmentation models.Then we inpaint the masked image and enhance subject presentation by image-image cross-attention.After that, the encoded feature serve as the input of the trainable copy and add conditions to the original U-Net by Multi-layers Condition Injection.
  • Figure 3: Subject-driven Generation. We utilize Multi-layers Condition Injection to add conditions. Specifically, we add the hidden states both in self and cross attention layers to the original model through a zero convolution. This can inject a detailed image representation and keep the generated subject. During training, we adopt denoising reconstruction loss $\mathcal{L}_{LDM}$ and attention consistency loss $\mathcal{L}_{ac}$ for consistent representation of attention of the selected subject.
  • Figure 4: Qualitative results of P3S-Diffusion .Our method is adaptable for both single subject generation and selective subject generation with points.
  • Figure 5: Result of different weight. Adjust the control weights between original U-Net and trainable copy will balance the prompt consistency and identity preservation. MLP represents a learnable control weight parameter.