Borrowing from anything: A generalizable framework for reference-guided instance editing
Shengxiao Zhou, Chenghua Li, Jianhao Huang, Qinghao Hu, Yifan Zhang
TL;DR
Reference-guided editing is hampered by semantic entanglement between a reference's intrinsic appearance and extrinsic attributes. GENIE introduces a Spatial Alignment Module (SAM) for pose/scale normalization, an Adaptive Residual Scaling Module (ARSM) for appearance purification, and a Progressive Attention Fusion (PAF) for controlled fusion within a dual U-Net latent diffusion framework, trained with a standard diffusion loss. Ablation studies and extensive AnyInsertion results show state-of-the-art fidelity, robustness, and disentanglement, across Object, Garment, and Person editing. The approach substantially improves texture fidelity, structural consistency, and semantic alignment, advancing practical deployment of reference-guided edits.
Abstract
Reference-guided instance editing is fundamentally limited by semantic entanglement, where a reference's intrinsic appearance is intertwined with its extrinsic attributes. The key challenge lies in disentangling what information should be borrowed from the reference, and determining how to apply it appropriately to the target. To tackle this challenge, we propose GENIE, a Generalizable Instance Editing framework capable of achieving explicit disentanglement. GENIE first corrects spatial misalignments with a Spatial Alignment Module (SAM). Then, an Adaptive Residual Scaling Module (ARSM) learns what to borrow by amplifying salient intrinsic cues while suppressing extrinsic attributes, while a Progressive Attention Fusion (PAF) mechanism learns how to render this appearance onto the target, preserving its structure. Extensive experiments on the challenging AnyInsertion dataset demonstrate that GENIE achieves state-of-the-art fidelity and robustness, setting a new standard for disentanglement-based instance editing.
