Table of Contents
Fetching ...

Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization

Guandong Li, Mengxia Ye

TL;DR

The proposed SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses, and introduces a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints to align with the diffusion generation dynamics.

Abstract

Personalized text-to-image generation aims to integrate specific identities into arbitrary contexts. However, existing tuning-free methods typically employ Spatially Uniform Visual Injection, causing identity features to contaminate non-facial regions (e.g., backgrounds and lighting) and degrading text adherence. To address this without expensive fine-tuning, we propose SpatialID, a training-free spatially-adaptive identity modulation framework. SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses. Furthermore, we introduce a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints - transitioning from Gaussian priors to attention-based masks and adaptive relaxation - to align with the diffusion generation dynamics. Extensive experiments on IBench demonstrate that SpatialID achieves state-of-the-art performance in text adherence (CLIP-T: 0.281), visual consistency (CLIP-I: 0.827), and image quality (IQ: 0.523), significantly eliminating background contamination while maintaining robust identity preservation.

Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization

TL;DR

The proposed SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses, and introduces a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints to align with the diffusion generation dynamics.

Abstract

Personalized text-to-image generation aims to integrate specific identities into arbitrary contexts. However, existing tuning-free methods typically employ Spatially Uniform Visual Injection, causing identity features to contaminate non-facial regions (e.g., backgrounds and lighting) and degrading text adherence. To address this without expensive fine-tuning, we propose SpatialID, a training-free spatially-adaptive identity modulation framework. SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses. Furthermore, we introduce a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints - transitioning from Gaussian priors to attention-based masks and adaptive relaxation - to align with the diffusion generation dynamics. Extensive experiments on IBench demonstrate that SpatialID achieves state-of-the-art performance in text adherence (CLIP-T: 0.281), visual consistency (CLIP-I: 0.827), and image quality (IQ: 0.523), significantly eliminating background contamination while maintaining robust identity preservation.
Paper Structure (18 sections, 8 equations, 2 figures, 2 tables)

This paper contains 18 sections, 8 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Overall architecture of SpatialID. The reference face image is processed by the PuLID ID Encoder (ArcFace + EVA-CLIP + IDFormer) to extract identity tokens $\mathbf{Z}_{id}$. At each injection point in the FLUX DiT, the SpatialID module extracts a spatial relevance mask $\mathbf{M}_t$ from the cross-attention output, and dynamically adjusts the mask form across different denoising stages through the Temporal-Spatial Scheduler, achieving spatially-adaptive identity injection: $\mathbf{h} \leftarrow \mathbf{h} + \alpha \cdot \mathbf{M}_t \odot \text{CA}(\mathbf{Z}_{id}, \mathbf{h})$.
  • Figure 2: Qualitative comparison. We present four highly challenging generation scenarios: space astronaut, medieval knight castle, Parisian café, and Renaissance oil painting portrait. Compared to PuLID and DVI, SpatialID achieves superior background semantic fidelity and natural lighting while maintaining identity consistency.