Continuous Control of Editing Models via Adaptive-Origin Guidance
Alon Wolf, Chen Katzir, Kfir Aberman, Or Patashnik
TL;DR
This work tackles the challenge of smoothly controlling edit strength in diffusion-based editing by addressing why standard CFG cannot provide gradual transitions. It introduces Adaptive-Origin Guidance (AdaOr), which interpolates between an identity-origin and the standard unconditional origin via a learned $\langle\texttt{id}\rangle$ instruction and a schedule $s(\alpha)$, enabling continuous, semantically faithful edits for both images and videos. AdaOr achieves smoother transitions, better text alignment, and more consistent perceptual trajectories across multiple baselines, without requiring per-edit data or specialized datasets. The approach preserves input content at low strengths and mirrors standard editing behavior at high strengths, with empirical validation through qualitative, quantitative, and user-study evaluations. Overall, AdaOr represents a practical, generalizable mechanism to harness edit strength control in diffusion-based editing tasks.
Abstract
Diffusion-based editing models have emerged as a powerful tool for semantic image and video manipulation. However, existing models lack a mechanism for smoothly controlling the intensity of text-guided edits. In standard text-conditioned generation, Classifier-Free Guidance (CFG) impacts prompt adherence, suggesting it as a potential control for edit intensity in editing models. However, we show that scaling CFG in these models does not produce a smooth transition between the input and the edited result. We attribute this behavior to the unconditional prediction, which serves as the guidance origin and dominates the generation at low guidance scales, while representing an arbitrary manipulation of the input content. To enable continuous control, we introduce Adaptive-Origin Guidance (AdaOr), a method that adjusts this standard guidance origin with an identity-conditioned adaptive origin, using an identity instruction corresponding to the identity manipulation. By interpolating this identity prediction with the standard unconditional prediction according to the edit strength, we ensure a continuous transition from the input to the edited result. We evaluate our method on image and video editing tasks, demonstrating that it provides smoother and more consistent control compared to current slider-based editing approaches. Our method incorporates an identity instruction into the standard training framework, enabling fine-grained control at inference time without per-edit procedure or reliance on specialized datasets.
