FreeViS: Training-free Video Stylization with Inconsistent References
Jiacong Xu, Yiqun Mei, Ke Zhang, Vishal M. Patel
TL;DR
FreeViS tackles the challenge of training-free video stylization by embedding multiple stylized references into a pretrained image-to-video diffusion model. It introduces Indirect High-frequency Compensation to preserve layout while maintaining appearance, Isolated-Attn and dynamic references to enable multi-reference propagation, and Explicit Optical Flow Guidance to stabilize textures in plain regions. Extensive experiments demonstrate superior stylization fidelity and temporal consistency against state-of-the-art baselines in both video stylization and stylized text-to-video generation. The approach offers a practical, training-free solution for high-quality, temporally coherent stylized videos with reasonable computation given inversion overhead.
Abstract
Video stylization plays a key role in content creation, but it remains a challenging problem. Naïvely applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/
