Table of Contents
Fetching ...

FreeViS: Training-free Video Stylization with Inconsistent References

Jiacong Xu, Yiqun Mei, Ke Zhang, Vishal M. Patel

TL;DR

FreeViS tackles the challenge of training-free video stylization by embedding multiple stylized references into a pretrained image-to-video diffusion model. It introduces Indirect High-frequency Compensation to preserve layout while maintaining appearance, Isolated-Attn and dynamic references to enable multi-reference propagation, and Explicit Optical Flow Guidance to stabilize textures in plain regions. Extensive experiments demonstrate superior stylization fidelity and temporal consistency against state-of-the-art baselines in both video stylization and stylized text-to-video generation. The approach offers a practical, training-free solution for high-quality, temporally coherent stylized videos with reasonable computation given inversion overhead.

Abstract

Video stylization plays a key role in content creation, but it remains a challenging problem. Naïvely applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/

FreeViS: Training-free Video Stylization with Inconsistent References

TL;DR

FreeViS tackles the challenge of training-free video stylization by embedding multiple stylized references into a pretrained image-to-video diffusion model. It introduces Indirect High-frequency Compensation to preserve layout while maintaining appearance, Isolated-Attn and dynamic references to enable multi-reference propagation, and Explicit Optical Flow Guidance to stabilize textures in plain regions. Extensive experiments demonstrate superior stylization fidelity and temporal consistency against state-of-the-art baselines in both video stylization and stylized text-to-video generation. The approach offers a practical, training-free solution for high-quality, temporally coherent stylized videos with reasonable computation given inversion overhead.

Abstract

Video stylization plays a key role in content creation, but it remains a challenging problem. Naïvely applying image stylization frame-by-frame hurts temporal consistency and reduces style richness. Alternatively, training a dedicated video stylization model typically requires paired video data and is computationally expensive. In this paper, we propose FreeViS, a training-free video stylization framework that generates stylized videos with rich style details and strong temporal coherence. Our method integrates multiple stylized references to a pretrained image-to-video (I2V) model, effectively mitigating the propagation errors observed in prior works, without introducing flickers and stutters. In addition, it leverages high-frequency compensation to constrain the content layout and motion, together with flow-based motion cues to preserve style textures in low-saliency regions. Through extensive evaluations, FreeViS delivers higher stylization fidelity and superior temporal consistency, outperforming recent baselines and achieving strong human preference. Our training-free pipeline offers a practical and economic solution for high-quality, temporally coherent video stylization. The code and videos can be accessed via https://xujiacong.github.io/FreeViS/

Paper Structure

This paper contains 34 sections, 24 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Previous works (e.g., AnyV2V ku2024anyv2v) suffer from propagation errors inherent to their single-reference inputs. Combined with a text-to-video model wan2025wan, FreeViS outperforms existing methods ye2025stylemasterliu2024stylecrafter on stylized video generation.
  • Figure 2: Visualization of cross-frame temporal (upper) and spatial (lower) attentions in different timesteps.
  • Figure 3: Overview of the FreeViS pipeline. Isolated-Attn indicates the ① mode, while MIsolated-Attn includes ② and ③ attention modes. Optical flow, extracted using RAFT teed2020raft, generates reference and flow masks for masked attention in attention modes ② and ③.
  • Figure 4: Qualitative comparison of FreeViS with other video editing methods on video stylization. The areas inside the bounding boxes show missing style textures and incorrect reconstruction.
  • Figure 5: Qualitative comparison of FreeViS with previous video and image stylization methods. The flickering issue of the image stylization methods can be observed in the supplemented video.
  • ...and 14 more figures