Table of Contents
Fetching ...

Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

Zeqian Long, Mingzhe Zheng, Kunyu Feng, Xinhua Zhang, Hongyu Liu, Harry Yang, Linfeng Zhang, Qifeng Chen, Yue Ma

TL;DR

Follow-Your-Shape tackles the problem of large-scale, shape-aware image editing with a training-free, mask-free approach. It introduces a Trajectory Divergence Map (TDM) that quantifies token-wise velocity differences between inversion and editing trajectories, and employs a three-stage editing process with a scheduled KV injection and ControlNet conditioning to localize edits and preserve background. The key contributions are the TDM-based region localization, a staged editing strategy that stabilizes trajectories, and the ReShapeBench benchmark for rigorous shape-transformation evaluation; results on ReShapeBench show state-of-the-art background preservation and text alignment, with metrics such as PSNR $= 35.79$, LPIPS $= 8.23\times 10^{-3}$, CLIP-Sim $= 33.71$, and Aesthetic Score $= 6.57$. This framework enables robust, large-scale shape edits in a computationally efficient, training-free manner, offering practical impact for precise content modification in real-world applications, while also providing a standardized benchmark for future shape-aware editing research.

Abstract

While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios -- particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled KV Injection mechanism that ensures stable and faithful editing. To facilitate a rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing. Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.

Follow-Your-Shape: Shape-Aware Image Editing via Trajectory-Guided Region Control

TL;DR

Follow-Your-Shape tackles the problem of large-scale, shape-aware image editing with a training-free, mask-free approach. It introduces a Trajectory Divergence Map (TDM) that quantifies token-wise velocity differences between inversion and editing trajectories, and employs a three-stage editing process with a scheduled KV injection and ControlNet conditioning to localize edits and preserve background. The key contributions are the TDM-based region localization, a staged editing strategy that stabilizes trajectories, and the ReShapeBench benchmark for rigorous shape-transformation evaluation; results on ReShapeBench show state-of-the-art background preservation and text alignment, with metrics such as PSNR , LPIPS , CLIP-Sim , and Aesthetic Score . This framework enables robust, large-scale shape edits in a computationally efficient, training-free manner, offering practical impact for precise content modification in real-world applications, while also providing a standardized benchmark for future shape-aware editing research.

Abstract

While recent flow-based image editing models demonstrate general-purpose capabilities across diverse tasks, they often struggle to specialize in challenging scenarios -- particularly those involving large-scale shape transformations. When performing such structural edits, these methods either fail to achieve the intended shape change or inadvertently alter non-target regions, resulting in degraded background quality. We propose Follow-Your-Shape, a training-free and mask-free framework that supports precise and controllable editing of object shapes while strictly preserving non-target content. Motivated by the divergence between inversion and editing trajectories, we compute a Trajectory Divergence Map (TDM) by comparing token-wise velocity differences between the inversion and denoising paths. The TDM enables precise localization of editable regions and guides a Scheduled KV Injection mechanism that ensures stable and faithful editing. To facilitate a rigorous evaluation, we introduce ReShapeBench, a new benchmark comprising 120 new images and enriched prompt pairs specifically curated for shape-aware editing. Experiments demonstrate that our method achieves superior editability and visual fidelity, particularly in tasks requiring large-scale shape replacement.

Paper Structure

This paper contains 38 sections, 17 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: We propose Follow-Your-Shape, a training- and mask-free image editing framework that excels at prompt-driven shape transformation. Our method enables flexible modification of arbitrary object shapes while strictly maintaining non-target content. The examples demonstrate both single-object and multi-object cases involving significant shape transformation.
  • Figure 2: Motivation for Trajectory Divergence Map (TDM) Guided Editing.Top: Vanilla editing methods (red) often produce unstable trajectories compared to the stable reconstruction path (orange), leading to distorted outputs. Middle: Our staged approach first stabilizes the editing trajectory before using the TDM to guide it toward the target concept. This method supports diverse shape modifications (dashed lines). Bottom: The TDM visualizes the dynamically localized editing region across different timesteps, with different border colors corresponding to different stages.
  • Figure 3: Overview of our proposed pipeline. Given a source image and the corresponding prompt, we first perform inversion to obtain the initial noisy latent code $x_T$. The editing process is then divided into three stages. In Stage 1, we stabilize the initial denoising trajectory by injecting key-value (KV) features from the inversion path into the denoising model during its initial steps. In Stage 2, we compute a Trajectory Divergence Map (TDM) by comparing the denoising trajectories generated from the source and edit prompts. This map is then processed to precisely identify the regions intended for editing. In Stage 3, guided by the TDM, blended KV features are injected into the final attention blocks of the denoising model to introduce the new semantics. Simultaneously, ControlNet conditions are supplied to ensure the edited regions conform to the original structure.
  • Figure 4: Distribution of values of $\tilde{M}_S$. The red dashed line indicates the Otsu threshold $\tau$.
  • Figure 5: Construction Process of ReShapeBench. Note that images in Step 3 are generated after benchmark construction to serve as visual references. Checklist validation is performed on prompt.
  • ...and 8 more figures