Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization

Guandong Li; Mengxia Ye

Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization

Guandong Li, Mengxia Ye

TL;DR

The proposed SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses, and introduces a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints to align with the diffusion generation dynamics.

Abstract

Personalized text-to-image generation aims to integrate specific identities into arbitrary contexts. However, existing tuning-free methods typically employ Spatially Uniform Visual Injection, causing identity features to contaminate non-facial regions (e.g., backgrounds and lighting) and degrading text adherence. To address this without expensive fine-tuning, we propose SpatialID, a training-free spatially-adaptive identity modulation framework. SpatialID fundamentally decouples identity injection into face-relevant and context-free regions using a Spatial Mask Extractor derived from cross-attention responses. Furthermore, we introduce a Temporal-Spatial Scheduling strategy that dynamically adjusts spatial constraints - transitioning from Gaussian priors to attention-based masks and adaptive relaxation - to align with the diffusion generation dynamics. Extensive experiments on IBench demonstrate that SpatialID achieves state-of-the-art performance in text adherence (CLIP-T: 0.281), visual consistency (CLIP-I: 0.827), and image quality (IQ: 0.523), significantly eliminating background contamination while maintaining robust identity preservation.

Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization

TL;DR

Abstract

Paper Structure (18 sections, 8 equations, 2 figures, 2 tables)

This paper contains 18 sections, 8 equations, 2 figures, 2 tables.

Introduction
Related Work
Personalized Text-to-Image Generation
Tuning-Free Identity Preserving Generation
Spatial Control and Attention Mechanisms
Method
Overview
Spatial Mask Extractor
Temporal-Spatial Scheduling
Implementation Details
Experiments
Experimental Settings
Qualitative Comparison
Quantitative Evaluation
Ablation Study
...and 3 more sections

Figures (2)

Figure 1: Overall architecture of SpatialID. The reference face image is processed by the PuLID ID Encoder (ArcFace + EVA-CLIP + IDFormer) to extract identity tokens $\mathbf{Z}_{id}$. At each injection point in the FLUX DiT, the SpatialID module extracts a spatial relevance mask $\mathbf{M}_t$ from the cross-attention output, and dynamically adjusts the mask form across different denoising stages through the Temporal-Spatial Scheduler, achieving spatially-adaptive identity injection: $\mathbf{h} \leftarrow \mathbf{h} + \alpha \cdot \mathbf{M}_t \odot \text{CA}(\mathbf{Z}_{id}, \mathbf{h})$.
Figure 2: Qualitative comparison. We present four highly challenging generation scenarios: space astronaut, medieval knight castle, Parisian café, and Renaissance oil painting portrait. Compared to PuLID and DVI, SpatialID achieves superior background semantic fidelity and natural lighting while maintaining identity consistency.

Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization

TL;DR

Abstract

Inject Where It Matters: Training-Free Spatially-Adaptive Identity Preservation for Text-to-Image Personalization

Authors

TL;DR

Abstract

Table of Contents

Figures (2)