On mitigating stability-plasticity dilemma in CLIP-guided image morphing via geodesic distillation loss
Yeongtak Oh, Saehyung Lee, Uiwon Hwang, Sungroh Yoon
TL;DR
The paper tackles the instability–plasticity dilemma in CLIP-guided image morphing, showing that naive directional CLIP guidance often misguides morphs away from the image manifold. It introduces a geodesic distillation loss with inter-modality (IMC) and intra-modality (IMR) regularizers, projected into a low-dimensional CLIP subspace, plus LPIPS, to enforce gradual, manifold-consistent morphing. Across StyleGAN-NADA and Text2Live, the approach yields more photorealistic morphs, better preservation of source attributes, and reduced hyperparameter sensitivity, even in out-of-domain prompts, while enabling CLIP inversion without pre-trained generators. By grounding guidance on the geodesic path within CLIP’s manifold, the method provides a drop-in improvement to existing CLIP-guided morphing pipelines with strong empirical gains and practical impact for image and video editing tasks. The work also clarifies the role of manifold geometry in multi-modal morphing and offers a principled route to mitigate catastrophic forgetting-like effects in continual-style morphing.
Abstract
Large-scale language-vision pre-training models, such as CLIP, have achieved remarkable text-guided image morphing results by leveraging several unconditional generative models. However, existing CLIP-guided image morphing methods encounter difficulties when morphing photorealistic images. Specifically, existing guidance fails to provide detailed explanations of the morphing regions within the image, leading to misguidance. In this paper, we observed that such misguidance could be effectively mitigated by simply using a proper regularization loss. Our approach comprises two key components: 1) a geodesic cosine similarity loss that minimizes inter-modality features (i.e., image and text) on a projected subspace of CLIP space, and 2) a latent regularization loss that minimizes intra-modality features (i.e., image and image) on the image manifold. By replacing the naïve directional CLIP loss in a drop-in replacement manner, our method achieves superior morphing results on both images and videos for various benchmarks, including CLIP-inversion.
