Table of Contents
Fetching ...

Towards Source-Aware Object Swapping with Initial Noise Perturbation

Jiahui Zhan, Xianbing Sun, Xiangnan Zhu, Yikun Ji, Ruitong Liu, Liqing Zhang, Jianfu Zhang

TL;DR

The key insight is to synthesize high-quality pseudo pairs from any image via a frequency-separated perturbation in the initial-noise space, which alters appearance while preserving pose, coarse shape, and scene layout, requiring no videos, multi-view data, or additional images.

Abstract

Object swapping aims to replace a source object in a scene with a reference object while preserving object fidelity, scene fidelity, and object-scene harmony. Existing methods either require per-object finetuning and slow inference or rely on extra paired data that mostly depict the same object across contexts, forcing models to rely on background cues rather than learning cross-object alignment. We propose SourceSwap, a self-supervised and source-aware framework that learns cross-object alignment. Our key insight is to synthesize high-quality pseudo pairs from any image via a frequency-separated perturbation in the initial-noise space, which alters appearance while preserving pose, coarse shape, and scene layout, requiring no videos, multi-view data, or additional images. We then train a dual U-Net with full-source conditioning and a noise-free reference encoder, enabling direct inter-object alignment, zero-shot inference without per-object finetuning, and lightweight iterative refinement. We further introduce SourceBench, a high-quality benchmark with higher resolution, more categories, and richer interactions. Experiments demonstrate that SourceSwap achieves superior fidelity, stronger scene preservation, and more natural harmony, and it transfers well to edits such as subject-driven refinement and face swapping.

Towards Source-Aware Object Swapping with Initial Noise Perturbation

TL;DR

The key insight is to synthesize high-quality pseudo pairs from any image via a frequency-separated perturbation in the initial-noise space, which alters appearance while preserving pose, coarse shape, and scene layout, requiring no videos, multi-view data, or additional images.

Abstract

Object swapping aims to replace a source object in a scene with a reference object while preserving object fidelity, scene fidelity, and object-scene harmony. Existing methods either require per-object finetuning and slow inference or rely on extra paired data that mostly depict the same object across contexts, forcing models to rely on background cues rather than learning cross-object alignment. We propose SourceSwap, a self-supervised and source-aware framework that learns cross-object alignment. Our key insight is to synthesize high-quality pseudo pairs from any image via a frequency-separated perturbation in the initial-noise space, which alters appearance while preserving pose, coarse shape, and scene layout, requiring no videos, multi-view data, or additional images. We then train a dual U-Net with full-source conditioning and a noise-free reference encoder, enabling direct inter-object alignment, zero-shot inference without per-object finetuning, and lightweight iterative refinement. We further introduce SourceBench, a high-quality benchmark with higher resolution, more categories, and richer interactions. Experiments demonstrate that SourceSwap achieves superior fidelity, stronger scene preservation, and more natural harmony, and it transfers well to edits such as subject-driven refinement and face swapping.
Paper Structure (26 sections, 5 equations, 21 figures, 6 tables)

This paper contains 26 sections, 5 equations, 21 figures, 6 tables.

Figures (21)

  • Figure 1: We propose SourceSwap, a source-aware object swapping framework that learns inter-object alignment from single images, achieving superior object fidelity, scene preservation, and object–scene harmony over prior methods. It is versatile in use, supporting multi-object swapping, face swapping, and subject-driven refinement.
  • Figure 2: Motivation of initial-noise perturbation. Existing inpainting-based methods rely on paired data from multi-view capture, videos, retrieval, or cropping, which are costly, blurry, or cannot model meaningful appearance changes of distinct objects. Our approach generates high-quality pseudo pairs from single images, producing clear yet coherent object variations while keeping the background intact.
  • Figure 3: Overview of SourceSwap. The pipeline consists of two phases: (1) Initial-noise perturbation, which converts any source image $I_s$ into a pseudo pair by transforming its initial latent $z_T$ to the frequency domain and locally permuting high-frequency components within the object mask to alter appearance while preserving coarse structure and scene context; and (2) Source-aware training, which uses these pseudo pairs to train a dual U-Net with full-source conditioning and a clean-reference encoder, enabling efficient learning of inter-object alignment and zero-shot inference. A lightweight iterative refinement step further improves object fidelity.
  • Figure 4: Left: decoded DDIM latents at different timesteps and their low and high frequency components. The high-frequency component retains most of the identity and structural cues, while the low-frequency component is coarse. Right: ablations during pseudo pair construction. Permuting all components severely distorts the object; permuting only the low-frequency component yields small changes; fixing the low-frequency and permuting only the high-frequency component produces coherent and meaningful appearance edits.
  • Figure 5: Pseudo pairs generated by initial‑noise perturbation. Only high‑frequency latents inside the mask are permuted. The images show appearance variation with preserved structure and scene harmony.
  • ...and 16 more figures