Does FLUX Already Know How to Perform Physically Plausible Image Composition?
Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, Adams Wai-Kin Kong
TL;DR
Does FLUX Already Know How to Perform Physically Plausible Image Composition? tackles the challenge of inserting a user-specified object into a new scene with realistic lighting and high-resolution inputs. It introduces SHINE, a training-free framework that combines Non-Inversion Latent Preparation with Manifold-Steered Anchor (MSA) loss, Degradation-Suppression Guidance (DSG), and Adaptive Background Blending (ABB), along with the ComplexCompo benchmark to evaluate across diverse resolutions and lighting. Empirical results on ComplexCompo and DreamEditBench show state-of-the-art performance on both objective and human-aligned metrics, with ablations validating the distinct contributions of each component. The work demonstrates that modern diffusion priors can be harnessed without inversion or retraining, enabling robust, physics-aware composition in challenging scenes.
Abstract
Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.
