Table of Contents
Fetching ...

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

Shilin Lu, Zhuming Lian, Zihan Zhou, Shaocong Zhang, Chen Zhao, Adams Wai-Kin Kong

TL;DR

Does FLUX Already Know How to Perform Physically Plausible Image Composition? tackles the challenge of inserting a user-specified object into a new scene with realistic lighting and high-resolution inputs. It introduces SHINE, a training-free framework that combines Non-Inversion Latent Preparation with Manifold-Steered Anchor (MSA) loss, Degradation-Suppression Guidance (DSG), and Adaptive Background Blending (ABB), along with the ComplexCompo benchmark to evaluate across diverse resolutions and lighting. Empirical results on ComplexCompo and DreamEditBench show state-of-the-art performance on both objective and human-aligned metrics, with ablations validating the distinct contributions of each component. The work demonstrates that modern diffusion priors can be harnessed without inversion or retraining, enabling robust, physics-aware composition in challenging scenes.

Abstract

Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

Does FLUX Already Know How to Perform Physically Plausible Image Composition?

TL;DR

Does FLUX Already Know How to Perform Physically Plausible Image Composition? tackles the challenge of inserting a user-specified object into a new scene with realistic lighting and high-resolution inputs. It introduces SHINE, a training-free framework that combines Non-Inversion Latent Preparation with Manifold-Steered Anchor (MSA) loss, Degradation-Suppression Guidance (DSG), and Adaptive Background Blending (ABB), along with the ComplexCompo benchmark to evaluate across diverse resolutions and lighting. Empirical results on ComplexCompo and DreamEditBench show state-of-the-art performance on both objective and human-aligned metrics, with ablations validating the distinct contributions of each component. The work demonstrates that modern diffusion priors can be harnessed without inversion or retraining, enabling robust, physics-aware composition in challenging scenes.

Abstract

Image composition aims to seamlessly insert a user-specified object into a new scene, but existing models struggle with complex lighting (e.g., accurate shadows, water reflections) and diverse, high-resolution inputs. Modern text-to-image diffusion models (e.g., SD3.5, FLUX) already encode essential physical and resolution priors, yet lack a framework to unleash them without resorting to latent inversion, which often locks object poses into contextually inappropriate orientations, or brittle attention surgery. We propose SHINE, a training-free framework for Seamless, High-fidelity Insertion with Neutralized Errors. SHINE introduces manifold-steered anchor loss, leveraging pretrained customization adapters (e.g., IP-Adapter) to guide latents for faithful subject representation while preserving background integrity. Degradation-suppression guidance and adaptive background blending are proposed to further eliminate low-quality outputs and visible seams. To address the lack of rigorous benchmarks, we introduce ComplexCompo, featuring diverse resolutions and challenging conditions such as low lighting, strong illumination, intricate shadows, and reflective surfaces. Experiments on ComplexCompo and DreamEditBench show state-of-the-art performance on standard metrics (e.g., DINOv2) and human-aligned scores (e.g., DreamSim, ImageReward, VisionReward). Code and benchmark will be publicly available upon publication.

Paper Structure

This paper contains 30 sections, 12 equations, 18 figures, 5 tables, 1 algorithm.

Figures (18)

  • Figure 1: Showcase of our training-free image composition method, SHINE. This gallery highlights SHINE's ability to seamlessly integrate subjects into complex scenes, including low-light conditions, intricate shadows, and water reflections.
  • Figure 2: Image composition from advanced multimodal models under three challenging conditions: backlighting, shadows, and water surfaces. Refer to Appendix \ref{['app:gpt']} for prompt details.
  • Figure 3: Overview of the proposed framework.(a) The noisy latent is created by inpainting the background with a VLM-derived object description, then adding Gaussian noise. (b) Manifold-Steered Anchor (MSA) loss guides noisy latents toward faithfully capturing the reference subject (red arrow), while preserving the structural integrity of the background. Concretely, it enforces that the prediction of the optimized latent $\boldsymbol{z}_{t}^*$ on the adapter-augmented model’s manifold remains close to the prediction of the original latent $\boldsymbol{z}_{t}$ on the base model’s manifold. (c) Degradation-Suppression Guidance (DSG) constructs a negative velocity pointing toward low-quality regions by blurring $\boldsymbol{Q}_\text{img}$ and, in a CFG-like manner, steers the trajectory away from this low-quality distribution.
  • Figure 4: Left: Robustness of FLUX. Right: Impacts of blurring different features in FLUX.
  • Figure 5: Comparison of rectangular-mask blending and Adaptive Background Blending (ABB). Boundary regions (pink dashed boxes) are enlarged for clarity. Zoom in for details.
  • ...and 13 more figures