Table of Contents
Fetching ...

FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

Zebin Yao, Lei Ren, Huixing Jiang, Wei Chen, Xiaojie Wang, Ruifan Li, Fangxiang Feng

TL;DR

FreeGraftor tackles the fidelity-efficiency gap in subject-driven text-to-image generation by delivering a training-free framework that transfers subject appearance via cross-image feature grafting. It introduces Semantic-Aware Feature Grafting (SAFG) within MM-DiT diffusion models, coupled with a structure-consistent initialization based on collage inversion, to preserve geometry and details without fine-tuning. The method achieves superior subject fidelity and text alignment compared with zero-shot and other training-free baselines and scales to multi-subject scenarios, all with favorable compute requirements. The work provides a practical, plug‑and‑play solution with open-source code, enabling robust, high-fidelity, text-guided personalization in real-world applications.

Abstract

Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance. However, existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive, subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor leverages semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated images. Additionally, our framework introduces a novel noise initialization strategy to preserve the geometry priors of reference subjects, facilitating robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.

FreeGraftor: Training-Free Cross-Image Feature Grafting for Subject-Driven Text-to-Image Generation

TL;DR

FreeGraftor tackles the fidelity-efficiency gap in subject-driven text-to-image generation by delivering a training-free framework that transfers subject appearance via cross-image feature grafting. It introduces Semantic-Aware Feature Grafting (SAFG) within MM-DiT diffusion models, coupled with a structure-consistent initialization based on collage inversion, to preserve geometry and details without fine-tuning. The method achieves superior subject fidelity and text alignment compared with zero-shot and other training-free baselines and scales to multi-subject scenarios, all with favorable compute requirements. The work provides a practical, plug‑and‑play solution with open-source code, enabling robust, high-fidelity, text-guided personalization in real-world applications.

Abstract

Subject-driven image generation aims to synthesize novel scenes that faithfully preserve subject identity from reference images while adhering to textual guidance. However, existing methods struggle with a critical trade-off between fidelity and efficiency. Tuning-based approaches rely on time-consuming and resource-intensive, subject-specific optimization, while zero-shot methods often fail to maintain adequate subject consistency. In this work, we propose FreeGraftor, a training-free framework that addresses these limitations through cross-image feature grafting. Specifically, FreeGraftor leverages semantic matching and position-constrained attention fusion to transfer visual details from reference subjects to the generated images. Additionally, our framework introduces a novel noise initialization strategy to preserve the geometry priors of reference subjects, facilitating robust feature matching. Extensive qualitative and quantitative experiments demonstrate that our method enables precise subject identity transfer while maintaining text-aligned scene synthesis. Without requiring model fine-tuning or additional training, FreeGraftor significantly outperforms existing zero-shot and training-free approaches in both subject fidelity and text alignment. Furthermore, our framework can seamlessly extend to multi-subject generation, making it practical for real-world deployment. Our code is available at https://github.com/Nihukat/FreeGraftor.

Paper Structure

This paper contains 27 sections, 14 equations, 12 figures, 4 tables, 2 algorithms.

Figures (12)

  • Figure 1: Cross-image semantic correspondence in FLUX.1. For each specified pixel in the reference image, we identify the most similar pixel in the target image within FLUX.1‘s feature space. These corresponding pixel pairs are then visually connected by colored lines.
  • Figure 2: Subject-driven generation results of our FreeGraftor, achieved without any training or tuning process. These results demonstrate the superior properties of FreeGraftor, including (a) pixel-level detail preservation, (b) flexible text-guided control, and (c) support for multiple reference subjects.
  • Figure 3: Overview of FreeGraftor. First, we construct a collage based on the given text prompt and the reference image (Stage 1). Next, we invert this collage and record its diffusion trajectory (Stage 2). Finally, using the inverted noise as the initial latent representation, FreeGraftor synthesizes the output image through iterative denoising. During this process, the Semantic-aware Feature Grafting (SAFG) module integrates features from the collage to ensure alignment with the reference subject (Stage 3).
  • Figure 4: Illustration of the Semantic-aware Feature Grafting (SAFG) Module. For each patch in the reference image, it first establishes semantic correspondence in the generated image via feature matching. The module then (1) concatenates the key and value pairs of the reference patch with those of the corresponding generated patch, while (2) copying and applying the position embedding from the generated patch to the reference patch's key. This mechanism enables effective positional information sharing between corresponding patches while maintaining semantic alignment.
  • Figure 5: Generation results with single reference subject using different methods. Our FreeGraftor achieves pixel-level detail preservation (e.g., text and patterns) while allowing flexible text-guided control (e.g., poses of teddy bears and dogs).
  • ...and 7 more figures