Table of Contents
Fetching ...

FlowFixer: Towards Detail-Preserving Subject-Driven Generation

Jinyoung Jun, Won-Dong Jang, Wenbin Ouyang, Raghudeep Gadde, Jungbeom Lee

TL;DR

FlowFixer is presented, a refinement framework for subject-driven generation (SDG) that restores fine details lost during generation caused by changes in scale and perspective of a subject and proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts.

Abstract

We present FlowFixer, a refinement framework for subject-driven generation (SDG) that restores fine details lost during generation caused by changes in scale and perspective of a subject. FlowFixer proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts. To enable image-to-image training, we introduce a one-step denoising scheme to generate self-supervised training data, which automatically removes high-frequency details while preserving global structure, effectively simulating real-world SDG errors. We further propose a keypoint matching-based metric to properly assess fidelity in details beyond semantic similarities usually measured by CLIP or DINO. Experimental results demonstrate that FlowFixer outperforms state-of-the-art SDG methods in both qualitative and quantitative evaluations, setting a new benchmark for high-fidelity subject-driven generation.

FlowFixer: Towards Detail-Preserving Subject-Driven Generation

TL;DR

FlowFixer is presented, a refinement framework for subject-driven generation (SDG) that restores fine details lost during generation caused by changes in scale and perspective of a subject and proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts.

Abstract

We present FlowFixer, a refinement framework for subject-driven generation (SDG) that restores fine details lost during generation caused by changes in scale and perspective of a subject. FlowFixer proposes direct image-to-image translation from visual references, avoiding ambiguities in language prompts. To enable image-to-image training, we introduce a one-step denoising scheme to generate self-supervised training data, which automatically removes high-frequency details while preserving global structure, effectively simulating real-world SDG errors. We further propose a keypoint matching-based metric to properly assess fidelity in details beyond semantic similarities usually measured by CLIP or DINO. Experimental results demonstrate that FlowFixer outperforms state-of-the-art SDG methods in both qualitative and quantitative evaluations, setting a new benchmark for high-fidelity subject-driven generation.
Paper Structure (19 sections, 6 equations, 9 figures, 2 tables)

This paper contains 19 sections, 6 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Detail enhancement on FLUX.1-Kontext-Pro labs2025flux, Qwen-Image-Edit bai2023qwen, and Nano-Banana-Edit comanici2025gemini using our FlowFixer. The red boxes indicate the zoomed-in regions. Compared to the baseline subject-driven generations, FlowFixer restores the fine details from the reference, such as complex structures (top and bottom), small text (top and middle), and human identity (middle). It also handles challenging cases involving rotation (top), viewpoint changes (middle), and color shifts (bottom), while preserving the overall scene composition. FlowFixer is a baseline-agnostic, prompt-free model designed to enhance subject fidelity without altering the global layout.
  • Figure 2: FlowFixer overview. FlowFixer enhances SDG images by restoring fine subject details, using the original subject image as reference.
  • Figure 3: FlowFixer inference pipeline. The model takes two conditional inputs: reference subject image ${\mathbf{I}}_{\text{ref}}$ and the generated image ${\mathbf{I}}_{\text{gen}}$ from any SDG model. Then the model produces a refined result $\widehat{{\mathbf{I}}}_{\text{gen}}$ that preserves global layout. For faster inference, we optionally refine only a subject-centric crop of ${\mathbf{I}}_{\text{gen}}$ and blend it back using Poisson image blending.
  • Figure 4: Example of one-step denoising distortions. For each distortion level, pixel-wise variance maps are computed over 10 degraded samples. Insets show example outputs, with distortions concentrated in high-frequency regions.
  • Figure 5: Qualitative comparison on Subject fidelity refinement on the FidelityBench-258K dataset. The insets in the full images show the reference subject images and the red and green boxes indicate the zoomed-in regions. The regions for zoomed-in views are found on the SDG baseline images and cropped the same area for all methods.
  • ...and 4 more figures