Table of Contents
Fetching ...

A Diffusion-Based Framework for Occluded Object Movement

Zheng-Peng Duan, Jiawei Zhang, Siyu Liu, Zheng Lin, Chun-Le Guo, Dongqing Zou, Jimmy Ren, Chongyi Li

TL;DR

This work introduces DiffOOM, a diffusion-based framework designed for moving occluded objects by decoupling the task into de-occlusion and movement executed in parallel branches on Stable Diffusion V1.5. The de-occlusion branch leverages color-filled inputs, latent hold, and LoRA-guided diffusion with refined cross-attention to reconstruct occluded regions, while the movement branch performs latent optimization and local text guidance to place the object at a target location and harmonize it with the scene. Key innovations include a latent-space framework with Latent Hold, a color-fill strategy to constrain generation, Latent Resizing to avoid degradation, a refined cross-attention map to capture object shape priors, and region-restricted text guidance to improve integration. The authors validate their approach on a COCOA-derived dataset, outperform multiple baselines in both de-occlusion and movement tasks, and corroborate findings with a user study, demonstrating practical utility for real-world image editing and potential integration with existing editing tools.

Abstract

Seamlessly moving objects within a scene is a common requirement for image editing, but it is still a challenge for existing editing methods. Especially for real-world images, the occlusion situation further increases the difficulty. The main difficulty is that the occluded portion needs to be completed before movement can proceed. To leverage the real-world knowledge embedded in the pre-trained diffusion models, we propose a Diffusion-based framework specifically designed for Occluded Object Movement, named DiffOOM. The proposed DiffOOM consists of two parallel branches that perform object de-occlusion and movement simultaneously. The de-occlusion branch utilizes a background color-fill strategy and a continuously updated object mask to focus the diffusion process on completing the obscured portion of the target object. Concurrently, the movement branch employs latent optimization to place the completed object in the target location and adopts local text-conditioned guidance to integrate the object into new surroundings appropriately. Extensive evaluations demonstrate the superior performance of our method, which is further validated by a comprehensive user study.

A Diffusion-Based Framework for Occluded Object Movement

TL;DR

This work introduces DiffOOM, a diffusion-based framework designed for moving occluded objects by decoupling the task into de-occlusion and movement executed in parallel branches on Stable Diffusion V1.5. The de-occlusion branch leverages color-filled inputs, latent hold, and LoRA-guided diffusion with refined cross-attention to reconstruct occluded regions, while the movement branch performs latent optimization and local text guidance to place the object at a target location and harmonize it with the scene. Key innovations include a latent-space framework with Latent Hold, a color-fill strategy to constrain generation, Latent Resizing to avoid degradation, a refined cross-attention map to capture object shape priors, and region-restricted text guidance to improve integration. The authors validate their approach on a COCOA-derived dataset, outperform multiple baselines in both de-occlusion and movement tasks, and corroborate findings with a user study, demonstrating practical utility for real-world image editing and potential integration with existing editing tools.

Abstract

Seamlessly moving objects within a scene is a common requirement for image editing, but it is still a challenge for existing editing methods. Especially for real-world images, the occlusion situation further increases the difficulty. The main difficulty is that the occluded portion needs to be completed before movement can proceed. To leverage the real-world knowledge embedded in the pre-trained diffusion models, we propose a Diffusion-based framework specifically designed for Occluded Object Movement, named DiffOOM. The proposed DiffOOM consists of two parallel branches that perform object de-occlusion and movement simultaneously. The de-occlusion branch utilizes a background color-fill strategy and a continuously updated object mask to focus the diffusion process on completing the obscured portion of the target object. Concurrently, the movement branch employs latent optimization to place the completed object in the target location and adopts local text-conditioned guidance to integrate the object into new surroundings appropriately. Extensive evaluations demonstrate the superior performance of our method, which is further validated by a comprehensive user study.

Paper Structure

This paper contains 26 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparison with other methods for occluded object movement. Given a real-world image, our method can seamlessly move the occluded object to a user-specified position while completing the occluded portion.
  • Figure 2: Overview of proposed framework (a) and LoRA tuning process (b). (a) We decouple the task of occluded object movement into de-occlusion and movement, handled by parallel branches. Both branches are built upon Stable Diffusion V1.5 and operate simultaneously. The de-occlusion branch leverages the prior knowledge within the diffusion models to complete the occluded portion, while the movement branch mainly places the completed object at the target position. (b) To ensure the content generated by the de-occlusion branch aligns with the characteristics of the target object, we equip this branch with LoRA, which is fine-tuned using a masked diffusion loss that applies exclusively to the visible portions of the object.
  • Figure 3: (a)-(b) showcase process of obtaining $\bar{\mathbf{I}}_s$ as Equ. (\ref{['equ:crop']}). (b) marks $1-\bar{\mathbf{M}}_v$ with white mask. (c) - (f) are results from variants of De-occlusion Branch. (c) is generated by filling $1-\bar{\mathbf{M}}_v$ with noise as Equ. (\ref{['equ:noise_fill']}). (d) introduces color-fill strategy as Equ. (\ref{['equ:color_fill']}). (e) is generated under the guidance of progressively updating masks. (f) is the full Deocclusion Branch. (g) showcases the progressively updating masks based on the refined cross-attention map $\bar{\mathbf{R}}_{t}^{C}$.
  • Figure 4: (a) and (d) are source images, and the others are results from variants of Movement Branch. The starting and ending points of the yellow arrows represent the original and target positions of the moved object. (b) is result with direct resizing as Equ. (\ref{['equ:lorg']}). (c) introduces the latent resizing operation as Equ. (\ref{['equ:lresize']}), alleviating the severe degradation. (e) and (f) are results w/o and w/ local text guidance, which helps the object integrate into surroundings more appropriately.
  • Figure 5: Qualitative comparison on de-occlusion. PCNet struggles with the completion of large-scale occlusion and complex objects, while our method can generate high-quality content consistent with the target object.
  • ...and 2 more figures