Move and Act: Enhanced Object Manipulation and Background Integrity for Image Editing
Pengfei Jiang, Mingbao Lin, Fei Chao
TL;DR
Move&Act introduces a tuning-free, two-branch diffusion-based framework that enables simultaneous control of an edited object's action and its generated position while preserving the background. It achieves this through object information transfer during inversion, source-area inpainting to remove remnants, and a background-preservation loss, followed by editing that uses self-attention features queried from the inversion timestep for consistent results with a K/V exchange strategy starting from step S=7. The approach yields superior prompt-alignment and background fidelity, validated by qualitative visuals and quantitative metrics (CLIP-score and AP_{50}) on a dedicated dataset, with code available at https://github.com/mobiushy/move-act. By avoiding a reconstruction branch and leveraging cross-attention-based localization, Move&Act delivers precise object relocation and action editing while maintaining surrounding scene integrity. Limitations on complex backgrounds and sensitivity to dilation Kernel size are acknowledged, with future work aimed at improving source-area inpainting and background recovery.
Abstract
Current methods commonly utilize three-branch structures of inversion, reconstruction, and editing, to tackle consistent image editing task. However, these methods lack control over the generation position of the edited object and have issues with background preservation. To overcome these limitations, we propose a tuning-free method with only two branches: inversion and editing. This approach allows users to simultaneously edit the object's action and control the generation position of the edited object. Additionally, it achieves improved background preservation. Specifically, we transfer the edited object information to the target area and repair or preserve the background of other areas during the inversion process at a specific time step. In the editing stage, we use the image features in self-attention to query the key and value of the corresponding time step in the inversion to achieve consistent image editing. Impressive image editing results and quantitative evaluation demonstrate the effectiveness of our method. The code is available at https://github.com/mobiushy/move-act.
