Table of Contents
Fetching ...

Move and Act: Enhanced Object Manipulation and Background Integrity for Image Editing

Pengfei Jiang, Mingbao Lin, Fei Chao

TL;DR

Move&Act introduces a tuning-free, two-branch diffusion-based framework that enables simultaneous control of an edited object's action and its generated position while preserving the background. It achieves this through object information transfer during inversion, source-area inpainting to remove remnants, and a background-preservation loss, followed by editing that uses self-attention features queried from the inversion timestep for consistent results with a K/V exchange strategy starting from step S=7. The approach yields superior prompt-alignment and background fidelity, validated by qualitative visuals and quantitative metrics (CLIP-score and AP_{50}) on a dedicated dataset, with code available at https://github.com/mobiushy/move-act. By avoiding a reconstruction branch and leveraging cross-attention-based localization, Move&Act delivers precise object relocation and action editing while maintaining surrounding scene integrity. Limitations on complex backgrounds and sensitivity to dilation Kernel size are acknowledged, with future work aimed at improving source-area inpainting and background recovery.

Abstract

Current methods commonly utilize three-branch structures of inversion, reconstruction, and editing, to tackle consistent image editing task. However, these methods lack control over the generation position of the edited object and have issues with background preservation. To overcome these limitations, we propose a tuning-free method with only two branches: inversion and editing. This approach allows users to simultaneously edit the object's action and control the generation position of the edited object. Additionally, it achieves improved background preservation. Specifically, we transfer the edited object information to the target area and repair or preserve the background of other areas during the inversion process at a specific time step. In the editing stage, we use the image features in self-attention to query the key and value of the corresponding time step in the inversion to achieve consistent image editing. Impressive image editing results and quantitative evaluation demonstrate the effectiveness of our method. The code is available at https://github.com/mobiushy/move-act.

Move and Act: Enhanced Object Manipulation and Background Integrity for Image Editing

TL;DR

Move&Act introduces a tuning-free, two-branch diffusion-based framework that enables simultaneous control of an edited object's action and its generated position while preserving the background. It achieves this through object information transfer during inversion, source-area inpainting to remove remnants, and a background-preservation loss, followed by editing that uses self-attention features queried from the inversion timestep for consistent results with a K/V exchange strategy starting from step S=7. The approach yields superior prompt-alignment and background fidelity, validated by qualitative visuals and quantitative metrics (CLIP-score and AP_{50}) on a dedicated dataset, with code available at https://github.com/mobiushy/move-act. By avoiding a reconstruction branch and leveraging cross-attention-based localization, Move&Act delivers precise object relocation and action editing while maintaining surrounding scene integrity. Limitations on complex backgrounds and sensitivity to dilation Kernel size are acknowledged, with future work aimed at improving source-area inpainting and background recovery.

Abstract

Current methods commonly utilize three-branch structures of inversion, reconstruction, and editing, to tackle consistent image editing task. However, these methods lack control over the generation position of the edited object and have issues with background preservation. To overcome these limitations, we propose a tuning-free method with only two branches: inversion and editing. This approach allows users to simultaneously edit the object's action and control the generation position of the edited object. Additionally, it achieves improved background preservation. Specifically, we transfer the edited object information to the target area and repair or preserve the background of other areas during the inversion process at a specific time step. In the editing stage, we use the image features in self-attention to query the key and value of the corresponding time step in the inversion to achieve consistent image editing. Impressive image editing results and quantitative evaluation demonstrate the effectiveness of our method. The code is available at https://github.com/mobiushy/move-act.
Paper Structure (24 sections, 8 equations, 11 figures, 1 table)

This paper contains 24 sections, 8 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: (a) - (d): our method allows users to control where to place the edited object. (e) - (i): altering only action, our method captures the "standing" pose and maintains consistent backgrounds with the input image.
  • Figure 2: Move&Act pipeline. In the inversion stage, we update latent code and transfer object information to the target area while repairing the source background. In the reverse stage, we use self-attention image features for consistent image editing.
  • Figure 3: We use cross attention map to locate the source area of the object, and use dilation operation to obtain the edge area around the source region.
  • Figure 4: Efficacy of source area inpainting loss $L_{sai}$. With/Without $L_{sai}$, the object fully/partly transfers to target area.
  • Figure 5: Editing prompt: A standing cat. Efficacy of background preservation loss $L_{bg}$. Without $L_{bg}$, background details like meadow become distorted resembling MasaCtrl. With $L_{bg}$, the background remains well-preserved.
  • ...and 6 more figures