Table of Contents
Fetching ...

DiffUHaul: A Training-Free Method for Object Dragging in Images

Omri Avrahami, Rinon Gal, Gal Chechik, Ohad Fried, Dani Lischinski, Arash Vahdat, Weili Nie

TL;DR

DiffUHaul tackles the challenge of training-free object dragging in images by leveraging a localized diffusion model (BlobGEN) and addressing entanglement through inference-time gated self-attention masking. It introduces a diffusion-anchoring mechanism that gradually fuses layout changes with preserved object appearance, aided by self-attention sharing and a soft anchoring strategy. The method extends to real images via DDPM self-attention bucketing and dedicated blob extraction plus background blending, and is evaluated with automatic metrics and user studies, showing robust performance against state-of-the-art baselines. The work advances practical object manipulation in complex scenes without per-image training, with potential impacts for creative tools and visual content editing, while acknowledging limitations in rotation, resizing, and object collisions.

Abstract

Text-to-image diffusion models have proven effective for solving many image editing tasks. However, the seemingly straightforward task of seamlessly relocating objects within a scene remains surprisingly challenging. Existing methods addressing this problem often struggle to function reliably in real-world scenarios due to lacking spatial reasoning. In this work, we propose a training-free method, dubbed DiffUHaul, that harnesses the spatial understanding of a localized text-to-image model, for the object dragging task. Blindly manipulating layout inputs of the localized model tends to cause low editing performance due to the intrinsic entanglement of object representation in the model. To this end, we first apply attention masking in each denoising step to make the generation more disentangled across different objects and adopt the self-attention sharing mechanism to preserve the high-level object appearance. Furthermore, we propose a new diffusion anchoring technique: in the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance; in the later denoising steps, we pass the localized features from the source images to the interpolated images to retain fine-grained object details. To adapt DiffUHaul to real-image editing, we apply a DDPM self-attention bucketing that can better reconstruct real images with the localized model. Finally, we introduce an automated evaluation pipeline for this task and showcase the efficacy of our method. Our results are reinforced through a user preference study.

DiffUHaul: A Training-Free Method for Object Dragging in Images

TL;DR

DiffUHaul tackles the challenge of training-free object dragging in images by leveraging a localized diffusion model (BlobGEN) and addressing entanglement through inference-time gated self-attention masking. It introduces a diffusion-anchoring mechanism that gradually fuses layout changes with preserved object appearance, aided by self-attention sharing and a soft anchoring strategy. The method extends to real images via DDPM self-attention bucketing and dedicated blob extraction plus background blending, and is evaluated with automatic metrics and user studies, showing robust performance against state-of-the-art baselines. The work advances practical object manipulation in complex scenes without per-image training, with potential impacts for creative tools and visual content editing, while acknowledging limitations in rotation, resizing, and object collisions.

Abstract

Text-to-image diffusion models have proven effective for solving many image editing tasks. However, the seemingly straightforward task of seamlessly relocating objects within a scene remains surprisingly challenging. Existing methods addressing this problem often struggle to function reliably in real-world scenarios due to lacking spatial reasoning. In this work, we propose a training-free method, dubbed DiffUHaul, that harnesses the spatial understanding of a localized text-to-image model, for the object dragging task. Blindly manipulating layout inputs of the localized model tends to cause low editing performance due to the intrinsic entanglement of object representation in the model. To this end, we first apply attention masking in each denoising step to make the generation more disentangled across different objects and adopt the self-attention sharing mechanism to preserve the high-level object appearance. Furthermore, we propose a new diffusion anchoring technique: in the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance; in the later denoising steps, we pass the localized features from the source images to the interpolated images to retain fine-grained object details. To adapt DiffUHaul to real-image editing, we apply a DDPM self-attention bucketing that can better reconstruct real images with the localized model. Finally, we introduce an automated evaluation pipeline for this task and showcase the efficacy of our method. Our results are reinforced through a user preference study.
Paper Structure (27 sections, 2 equations, 13 figures, 5 tables)

This paper contains 27 sections, 2 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Object Dragging Robustness. When dragging a puppy in a complex environment (particularly with its reflection in the water and ripples nearby) to different locations along from left to right, previous method DiffEdit Mou2024DiffEditorBA struggles with the editing traces left in its original location, while our method demonstrates a more robust behavior.
  • Figure 2: BlobGEN Architecture. BlobGEN incorporates the additional blob information into the Stable Diffusion model by adding two new layers in each attention block: masked cross-attention and gated cross-attention.
  • Figure 3: Method Overview. Given an input image $I$, we start by extracting the blob parameters $P_s$ of its layout; then, by changing its layout based on the user provided target location, we get the new blob parameters $P_d$. By conditioning the localized text-to-image model on the respective blob representations, we iteratively denoise the source and target images ($z_s$ and $z_d$) while incorporating gated self-attention masking (\ref{['sec:blobgen_entanglement']}) and soft attention anchoring (\ref{['sec:object_moving_generated']}) in each self-attention block until we get the desired editing result $I'$.
  • Figure 4: Gated Self-Attention Leakage. Given scene descriptions of two blobs: "a photo of a rabbit" and "a photo of a cat", we can see that the standard BlobGEN model (the first column in the first row) generates two rabbits instead of a cat and a rabbit, we then visualize the gated self-attention layers, as explained in \ref{['sec:blobgen_entanglement']}. As can be seen, the standard BlobGEN model (first row) leaks the rabbit information also to the cat blob (the first row third column), while our masked version of the gated self-attention (second row) is able to disentangle the blobs (the second row third column). In addition, we can see that the gated self-attention (second column) behaves de facto as a cross-attention layer, as the vast majority of the attention is between the text tokens $T$ and the visual tokens $V$.
  • Figure 5: Self-Attention Soft Anchoring. Given the source blob $B_s$ and target blob $B_d$, we start by extracting the self-attention outputs $O_s$ and $O_d$ correspondingly, then, during the first $\rho$ iterations, we blend these maps according to the timestep ratio $f = \frac{t}{T}$ where $t$ is the current timestep and $T$ is the total number of timesteps. Then, after the anchor map $O_a$ is calculated, we use it for determining the position of the new blob, while taking the appearance from the corresponding $O_s$ map using nearest-neighbor copying.
  • ...and 8 more figures