Table of Contents
Fetching ...

Continuous Control of Editing Models via Adaptive-Origin Guidance

Alon Wolf, Chen Katzir, Kfir Aberman, Or Patashnik

TL;DR

This work tackles the challenge of smoothly controlling edit strength in diffusion-based editing by addressing why standard CFG cannot provide gradual transitions. It introduces Adaptive-Origin Guidance (AdaOr), which interpolates between an identity-origin and the standard unconditional origin via a learned $\langle\texttt{id}\rangle$ instruction and a schedule $s(\alpha)$, enabling continuous, semantically faithful edits for both images and videos. AdaOr achieves smoother transitions, better text alignment, and more consistent perceptual trajectories across multiple baselines, without requiring per-edit data or specialized datasets. The approach preserves input content at low strengths and mirrors standard editing behavior at high strengths, with empirical validation through qualitative, quantitative, and user-study evaluations. Overall, AdaOr represents a practical, generalizable mechanism to harness edit strength control in diffusion-based editing tasks.

Abstract

Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly controlling the intensity of text-guided edits. In standard text-conditioned generation, Classifier-Free Guidance (CFG) impacts prompt adherence, suggesting it as a potential control for edit intensity in editing models. However, we show that scaling CFG in these models does not produce a smooth transition between the input and the edited result. We attribute this behavior to the unconditional prediction, which serves as the guidance origin and dominates the generation at low guidance scales, while representing an arbitrary manipulation of the input content. To enable continuous control, we introduce Adaptive-Origin Guidance (AdaOr), a method that adjusts this standard guidance origin with an identity-conditioned adaptive origin, using an identity instruction corresponding to the identity manipulation. By interpolating this identity prediction with the standard unconditional prediction according to the edit strength, we ensure a continuous transition from the input to the edited result. We evaluate our method on image and video editing tasks, demonstrating that it provides smoother and more consistent control compared to current slider-based editing approaches. Our method incorporates an identity instruction into the standard training framework, enabling fine-grained control at inference time without per-edit procedure or reliance on specialized datasets.

Continuous Control of Editing Models via Adaptive-Origin Guidance

TL;DR

This work tackles the challenge of smoothly controlling edit strength in diffusion-based editing by addressing why standard CFG cannot provide gradual transitions. It introduces Adaptive-Origin Guidance (AdaOr), which interpolates between an identity-origin and the standard unconditional origin via a learned instruction and a schedule , enabling continuous, semantically faithful edits for both images and videos. AdaOr achieves smoother transitions, better text alignment, and more consistent perceptual trajectories across multiple baselines, without requiring per-edit data or specialized datasets. The approach preserves input content at low strengths and mirrors standard editing behavior at high strengths, with empirical validation through qualitative, quantitative, and user-study evaluations. Overall, AdaOr represents a practical, generalizable mechanism to harness edit strength control in diffusion-based editing tasks.

Abstract

Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly controlling the intensity of text-guided edits. In standard text-conditioned generation, Classifier-Free Guidance (CFG) impacts prompt adherence, suggesting it as a potential control for edit intensity in editing models. However, we show that scaling CFG in these models does not produce a smooth transition between the input and the edited result. We attribute this behavior to the unconditional prediction, which serves as the guidance origin and dominates the generation at low guidance scales, while representing an arbitrary manipulation of the input content. To enable continuous control, we introduce Adaptive-Origin Guidance (AdaOr), a method that adjusts this standard guidance origin with an identity-conditioned adaptive origin, using an identity instruction corresponding to the identity manipulation. By interpolating this identity prediction with the standard unconditional prediction according to the edit strength, we ensure a continuous transition from the input to the edited result. We evaluate our method on image and video editing tasks, demonstrating that it provides smoother and more consistent control compared to current slider-based editing approaches. Our method incorporates an identity instruction into the standard training framework, enabling fine-grained control at inference time without per-edit procedure or reliance on specialized datasets.
Paper Structure (28 sections, 9 equations, 18 figures, 2 tables)

This paper contains 28 sections, 9 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Comparison with standard CFG scaling. We compare the progression of the edit as the intensity increases from 0 (left) to 1 (right). Standard CFG (top) exhibits arbitrary modifications at lower scales (e.g., the gold face paint). In contrast, our method (bottom) enables a smooth, continuous transition, gradually introducing the mask features while maintaining the structural integrity of the face throughout the interpolation.
  • Figure 2: Geometric interpretation of standard Classifier-Free Guidance and Adaptive Origin Guidance. In both (a) and (b), we illustrate a single denoising step in which the latent $\mathbf{z}_t$ lies on the manifold of the marginal distribution $p_t$. The origin prediction first denoises $\mathbf{z}_t$ toward the manifold of the less noisy distribution $p_{t-1}$, after which the trajectory is steered on this manifold toward better alignment with the conditioning signal. In (a) standard CFG, the origin is given by $\epsilon_t(\varnothing)$, and the guidance direction is $\epsilon_t(c_T) - \epsilon_t(\varnothing)$. In (b) Adaptive Origin Guidance (ours), the origin is interpolated between the identity prediction $\epsilon_t(\text{$\langle\texttt{id}\rangle$})$ and the standard null prediction, $\epsilon_t(\varnothing)$, as a function of the edit strength. This ensures faithful reconstruction at low strengths while smoothly recovering standard CFG behavior at higher strengths. In (c), we show the edit progression as a function of edit strength. While standard CFG originates from the unconditional prediction (representing arbitrary edits), Adaptive Origin Guidance originates from the identity prediction, creating a trajectory that smoothly transitions from the input image to the target edit.
  • Figure 3: Qualitative comparison. We compare our method (bottom) against Kontinuous Kontext (top) and FreeMorph cao2025freemorph (middle). Kontinuous Kontext parihar2025kontinuous suffers from semantic entanglement, removing the rain and altering the woman's expression as the edit strength increases. FreeMorph fails to generate plausible intermediate states, producing severe artifacts (e.g., the distorted sleeve and hand at strength 0.2). In contrast, AdaOr (Ours) produces a smooth, linear transition that effectively applies the edit while strictly preserving the input content and the subject's appearance.
  • Figure 4: Qualitative comparison. We compare our method (bottom) against Concept Sliders (top) and SAEdit (middle). Concept Sliders modifies the man's identity, while SAEdit introduces only weak curl patterns. AdaOr (Ours) produces a smooth transition that effectively applies the edit while preserving the input content and the subject's identity.
  • Figure 5: User study results. We report the win rate of our method compared to three baselines: Kontinuous Kontext, Lucy-Freemorph, and Qwen-Freemorph. Human evaluators assessed three aspects: smoothness of the transition, intermediate quality, and overall preference.
  • ...and 13 more figures