Table of Contents
Fetching ...

SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models

Haoyu Zheng, Qifan Yu, Binghe Yu, Yang Dai, Wenqiao Zhang, Juncheng Li, Siliang Tang, Yueting Zhuang

TL;DR

SOYO presents a tuning-free diffusion-based framework for open-domain video style morphing that preserves structural content while smoothly transitioning between two style references. It combines cross-frame style fusion attention, Dual-Style Latent AdaIN, and Adaptive Style Distance Mapping to interpolate style features and color statistics over time without fine-tuning a pre-trained model. The method achieves superior temporal coherence and structural preservation on the SOYO-Test benchmark, demonstrating effective handling of diverse scenes and artistic styles. The approach offers a practical, efficient solution for high-fidelity multi-style video stylization with minimal additional computational overhead beyond inversion and diffusion steps.

Abstract

Diffusion models have achieved remarkable progress in image and video stylization. However, most existing methods focus on single-style transfer, while video stylization involving multiple styles necessitates seamless transitions between them. We refer to this smooth style transition between video frames as video style morphing. Current approaches often generate stylized video frames with discontinuous structures and abrupt style changes when handling such transitions. To address these limitations, we introduce SOYO, a novel diffusion-based framework for video style morphing. Our method employs a pre-trained text-to-image diffusion model without fine-tuning, combining attention injection and AdaIN to preserve structural consistency and enable smooth style transitions across video frames. Moreover, we notice that applying linear equidistant interpolation directly induces imbalanced style morphing. To harmonize across video frames, we propose a novel adaptive sampling scheduler operating between two style images. Extensive experiments demonstrate that SOYO outperforms existing methods in open-domain video style morphing, better preserving the structural coherence of video frames while achieving stable and smooth style transitions.

SOYO: A Tuning-Free Approach for Video Style Morphing via Style-Adaptive Interpolation in Diffusion Models

TL;DR

SOYO presents a tuning-free diffusion-based framework for open-domain video style morphing that preserves structural content while smoothly transitioning between two style references. It combines cross-frame style fusion attention, Dual-Style Latent AdaIN, and Adaptive Style Distance Mapping to interpolate style features and color statistics over time without fine-tuning a pre-trained model. The method achieves superior temporal coherence and structural preservation on the SOYO-Test benchmark, demonstrating effective handling of diverse scenes and artistic styles. The approach offers a practical, efficient solution for high-fidelity multi-style video stylization with minimal additional computational overhead beyond inversion and diffusion steps.

Abstract

Diffusion models have achieved remarkable progress in image and video stylization. However, most existing methods focus on single-style transfer, while video stylization involving multiple styles necessitates seamless transitions between them. We refer to this smooth style transition between video frames as video style morphing. Current approaches often generate stylized video frames with discontinuous structures and abrupt style changes when handling such transitions. To address these limitations, we introduce SOYO, a novel diffusion-based framework for video style morphing. Our method employs a pre-trained text-to-image diffusion model without fine-tuning, combining attention injection and AdaIN to preserve structural consistency and enable smooth style transitions across video frames. Moreover, we notice that applying linear equidistant interpolation directly induces imbalanced style morphing. To harmonize across video frames, we propose a novel adaptive sampling scheduler operating between two style images. Extensive experiments demonstrate that SOYO outperforms existing methods in open-domain video style morphing, better preserving the structural coherence of video frames while achieving stable and smooth style transitions.

Paper Structure

This paper contains 16 sections, 9 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Video style morphing results generated by SOYO.
  • Figure 2: In the context of video stylization tasks involving style transitions, abrupt style switching can result in visually disjointed effects. In contrast, a smooth and gradual style transition facilitates a visually natural transformation between styles.
  • Figure 3: SOYO pipeline. We perform DDIM Inversion on the source video $I^{[1:N]}$ to obtain $z^c_T$, and execute DDIM Inversion on two style images $I^{s0}$ and $I^{s1}$ to get $z^{s0}_T$ and $z^{s1}_T$, respectively, while saving their attention values.Subsequently, we perform linear interpolation on $z^{s0}_t$ and $z^{s1}_t$ with specific weights to obtain $z^s_t$, which is then used to modulate $z^c_t$ via AdaIN. During the denoising process, the latents corresponding to each frame are injected with interpolated $K^s$ and $V^s$ values from the style images, while receiving $Q^c$ injections from the source video frames. This results in smoothly transitioned stylized video frames $I'^{[1:N]}$.
  • Figure 4: The injection of keys and values derived from a single style image results in stylized video frames that fail to accurately represent the desired textures and transitions.
  • Figure 5: Qualitative comparisons of SOYO with existing methods.
  • ...and 3 more figures