FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation
Zebin Yao, Lei Ren, Huixing Jiang, Wei Chen, Xiaojie Wang, Ruifan Li, Fangxiang Feng
TL;DR
FreeGraftor tackles the fidelity-efficiency gap in subject-driven text-to-image generation by delivering a training-free framework that transfers subject appearance via cross-image feature grafting. It introduces Semantic-Aware Feature Grafting (SAFG) within MM-DiT diffusion models, coupled with a structure-consistent initialization based on collage inversion, to preserve geometry and details without fine-tuning. The method achieves superior subject fidelity and text alignment compared with zero-shot and other training-free baselines and scales to multi-subject scenarios, all with favorable compute requirements. The work provides a practical, plug‑and‑play solution with open-source code, enabling robust, high-fidelity, text-guided personalization in real-world applications.
Abstract
Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance. However, existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive, subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor leverages semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated images. Additionally, our framework introduces a novel noise initialization strategy to preserve the geometry priors of reference subjects, facilitating robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.
