A Diffusion-Based Framework for Occluded Object Movement
Zheng-Peng Duan, Jiawei Zhang, Siyu Liu, Zheng Lin, Chun-Le Guo, Dongqing Zou, Jimmy Ren, Chongyi Li
TL;DR
This work introduces DiffOOM, a diffusion-based framework designed for moving occluded objects by decoupling the task into de-occlusion and movement executed in parallel branches on Stable Diffusion V1.5. The de-occlusion branch leverages color-filled inputs, latent hold, and LoRA-guided diffusion with refined cross-attention to reconstruct occluded regions, while the movement branch performs latent optimization and local text guidance to place the object at a target location and harmonize it with the scene. Key innovations include a latent-space framework with Latent Hold, a color-fill strategy to constrain generation, Latent Resizing to avoid degradation, a refined cross-attention map to capture object shape priors, and region-restricted text guidance to improve integration. The authors validate their approach on a COCOA-derived dataset, outperform multiple baselines in both de-occlusion and movement tasks, and corroborate findings with a user study, demonstrating practical utility for real-world image editing and potential integration with existing editing tools.
Abstract
Seamlessly moving objects within a scene is a common requirement for image editing, but it is still a challenge for existing editing methods. Especially for real-world images, the occlusion situation further increases the difficulty. The main difficulty is that the occluded portion needs to be completed before movement can proceed. To leverage the real-world knowledge embedded in the pre-trained diffusion models, we propose a Diffusion-based framework specifically designed for Occluded Object Movement, named DiffOOM. The proposed DiffOOM consists of two parallel branches that perform object de-occlusion and movement simultaneously. The de-occlusion branch utilizes a background color-fill strategy and a continuously updated object mask to focus the diffusion process on completing the obscured portion of the target object. Concurrently, the movement branch employs latent optimization to place the completed object in the target location and adopts local text-conditioned guidance to integrate the object into new surroundings appropriately. Extensive evaluations demonstrate the superior performance of our method, which is further validated by a comprehensive user study.
