Table of Contents
Fetching ...

On mitigating stability-plasticity dilemma in CLIP-guided image morphing via geodesic distillation loss

Yeongtak Oh, Saehyung Lee, Uiwon Hwang, Sungroh Yoon

TL;DR

The paper tackles the instability–plasticity dilemma in CLIP-guided image morphing, showing that naive directional CLIP guidance often misguides morphs away from the image manifold. It introduces a geodesic distillation loss with inter-modality (IMC) and intra-modality (IMR) regularizers, projected into a low-dimensional CLIP subspace, plus LPIPS, to enforce gradual, manifold-consistent morphing. Across StyleGAN-NADA and Text2Live, the approach yields more photorealistic morphs, better preservation of source attributes, and reduced hyperparameter sensitivity, even in out-of-domain prompts, while enabling CLIP inversion without pre-trained generators. By grounding guidance on the geodesic path within CLIP’s manifold, the method provides a drop-in improvement to existing CLIP-guided morphing pipelines with strong empirical gains and practical impact for image and video editing tasks. The work also clarifies the role of manifold geometry in multi-modal morphing and offers a principled route to mitigate catastrophic forgetting-like effects in continual-style morphing.

Abstract

Large-scale language-vision pre-training models, such as CLIP, have achieved remarkable text-guided image morphing results by leveraging several unconditional generative models. However, existing CLIP-guided image morphing methods encounter difficulties when morphing photorealistic images. Specifically, existing guidance fails to provide detailed explanations of the morphing regions within the image, leading to misguidance. In this paper, we observed that such misguidance could be effectively mitigated by simply using a proper regularization loss. Our approach comprises two key components: 1) a geodesic cosine similarity loss that minimizes inter-modality features (i.e., image and text) on a projected subspace of CLIP space, and 2) a latent regularization loss that minimizes intra-modality features (i.e., image and image) on the image manifold. By replacing the naïve directional CLIP loss in a drop-in replacement manner, our method achieves superior morphing results on both images and videos for various benchmarks, including CLIP-inversion.

On mitigating stability-plasticity dilemma in CLIP-guided image morphing via geodesic distillation loss

TL;DR

The paper tackles the instability–plasticity dilemma in CLIP-guided image morphing, showing that naive directional CLIP guidance often misguides morphs away from the image manifold. It introduces a geodesic distillation loss with inter-modality (IMC) and intra-modality (IMR) regularizers, projected into a low-dimensional CLIP subspace, plus LPIPS, to enforce gradual, manifold-consistent morphing. Across StyleGAN-NADA and Text2Live, the approach yields more photorealistic morphs, better preservation of source attributes, and reduced hyperparameter sensitivity, even in out-of-domain prompts, while enabling CLIP inversion without pre-trained generators. By grounding guidance on the geodesic path within CLIP’s manifold, the method provides a drop-in improvement to existing CLIP-guided morphing pipelines with strong empirical gains and practical impact for image and video editing tasks. The work also clarifies the role of manifold geometry in multi-modal morphing and offers a principled route to mitigate catastrophic forgetting-like effects in continual-style morphing.

Abstract

Large-scale language-vision pre-training models, such as CLIP, have achieved remarkable text-guided image morphing results by leveraging several unconditional generative models. However, existing CLIP-guided image morphing methods encounter difficulties when morphing photorealistic images. Specifically, existing guidance fails to provide detailed explanations of the morphing regions within the image, leading to misguidance. In this paper, we observed that such misguidance could be effectively mitigated by simply using a proper regularization loss. Our approach comprises two key components: 1) a geodesic cosine similarity loss that minimizes inter-modality features (i.e., image and text) on a projected subspace of CLIP space, and 2) a latent regularization loss that minimizes intra-modality features (i.e., image and image) on the image manifold. By replacing the naïve directional CLIP loss in a drop-in replacement manner, our method achieves superior morphing results on both images and videos for various benchmarks, including CLIP-inversion.
Paper Structure (22 sections, 8 equations, 10 figures, 1 table)

This paper contains 22 sections, 8 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: The visualization represents the CLIP space, where image and text features are $L_{2}$-normalized, illustrating an example of morphing from 'human' to 'hulk'. In CLIP-guided image morphing, $Z^{I}_{s}$ continuously transforms into $Z^{I}_{t}$ by following the text guidance of $Z^{T}_{s}$ to $Z^{T}_{t}$. Here, $Z^{I}$ and $Z^{T}$ denote image and text features, respectively. In our proposed method, the feature of a morphed image is represented by $Z^{I}_{t,1}$, whereas the baseline method employs $Z^{I}_{t,2}$. Specifically, our approach guides the morphing process along the image manifold, resulting in more photorealistic morphed images.
  • Figure 2: Results of the CLIP-guided image morphing. Original images are generated from StyleGAN pre-trained with FFHQ dataset. The first row is the result of the baseline method, and the second row is the result of the proposed method.
  • Figure 3: Dimensional studies to select the optimal value of subspace dimension.
  • Figure 4: Continuous image metamorphosis according to the iterations for the cases of 'hulk', 'superman', and 'special forces' with (a) the baseline and (b) our proposed method.
  • Figure 5: Visualization of CLIP scores. (a) denotes the extent of image morphing from source images, and (b) denotes the extent of image morphing towards the target image manifold. Our method consistently outperforms the baseline for all of the given prompts and each training iteration.
  • ...and 5 more figures