Table of Contents
Fetching ...

UniVST: A Unified Framework for Training-free Localized Video Style Transfer

Quanjian Song, Mingbao Lin, Wengyi Zhan, Shuicheng Yan, Liujuan Cao, Rongrong Ji

TL;DR

UniVST tackles localized video style transfer without training by combining three innovations: point-matching mask propagation to obtain frame-specific masks from DDIM inversion, training-free AdaIN-guided localized stylization with latent and attention interactions, and a sliding-window smoothing scheme that leverages optical flow to enhance temporal coherence. The approach yields precise foreground styling while preserving content fidelity and reducing flicker, outperforming state-of-the-art baselines on two DAVTG datasets across multiple backbones. While introducing additional computational overhead due to inversion and smoothing, UniVST demonstrates strong generalization and practical applicability in a training-free regime. This work advances diffusion-based video editing toward fine-grained, temporally coherent localized styling for broader real-world use.

Abstract

This paper presents UniVST, a unified framework for localized video style transfer based on diffusion models. It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos. The endeavors of this paper comprise: (1) A point-matching mask propagation strategy that leverages the feature maps from the DDIM inversion. This streamlines the model's architecture by obviating the need for tracking models. (2) A training-free AdaIN-guided localized video stylization mechanism that operates at both the latent and attention levels. This balances content fidelity and style richness, mitigating the loss of localized details commonly associated with direct video stylization. (3) A sliding-window consistent smoothing scheme that harnesses optical flow within the pixel representation and refines predicted noise to update the latent space. This significantly enhances temporal consistency and diminishes artifacts in stylized video. Our proposed UniVST has been validated to be superior to existing methods in quantitative and qualitative metrics. It adeptly addresses the challenges of preserving the primary object's style while ensuring temporal consistency and detail preservation. Our code is available at https://github.com/QuanjianSong/UniVST.

UniVST: A Unified Framework for Training-free Localized Video Style Transfer

TL;DR

UniVST tackles localized video style transfer without training by combining three innovations: point-matching mask propagation to obtain frame-specific masks from DDIM inversion, training-free AdaIN-guided localized stylization with latent and attention interactions, and a sliding-window smoothing scheme that leverages optical flow to enhance temporal coherence. The approach yields precise foreground styling while preserving content fidelity and reducing flicker, outperforming state-of-the-art baselines on two DAVTG datasets across multiple backbones. While introducing additional computational overhead due to inversion and smoothing, UniVST demonstrates strong generalization and practical applicability in a training-free regime. This work advances diffusion-based video editing toward fine-grained, temporally coherent localized styling for broader real-world use.

Abstract

This paper presents UniVST, a unified framework for localized video style transfer based on diffusion models. It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos. The endeavors of this paper comprise: (1) A point-matching mask propagation strategy that leverages the feature maps from the DDIM inversion. This streamlines the model's architecture by obviating the need for tracking models. (2) A training-free AdaIN-guided localized video stylization mechanism that operates at both the latent and attention levels. This balances content fidelity and style richness, mitigating the loss of localized details commonly associated with direct video stylization. (3) A sliding-window consistent smoothing scheme that harnesses optical flow within the pixel representation and refines predicted noise to update the latent space. This significantly enhances temporal consistency and diminishes artifacts in stylized video. Our proposed UniVST has been validated to be superior to existing methods in quantitative and qualitative metrics. It adeptly addresses the challenges of preserving the primary object's style while ensuring temporal consistency and detail preservation. Our code is available at https://github.com/QuanjianSong/UniVST.

Paper Structure

This paper contains 21 sections, 14 equations, 18 figures, 8 tables, 1 algorithm.

Figures (18)

  • Figure 1: Existing methods suffer from (a) lack of fine-grained control, (b) imbalance between content fidelity and style richness, and (c) temporal inconsistency, whereas our UniVST effectively overcomes these challenges. The last row visualizes the coherence across frames using MimicMotion's Y-T slices mimic_motion for both BIVDiff BIVDiff and our UniVST.
  • Figure 2: Overall framework of our UniVST. It consists of three core components. (1) Point-Matching Mask Propagation, (2) AdaIN-Guided Localized Video Stylization (Localized Latent Blending, AdaIN-Guided Latent-Shift, and Attention-Shift), and (3) Sliding-Window Consistent Smoothing.
  • Figure 3: The comparison between our mask propagation strategy and the traditional SAM model in terms of memory usage and inference time.
  • Figure 4: Comparison of accuracy and inference time under different mask propagation strategies. The accuracy is measured by relevant segmentation metrics: the intersection over union (IoU) and dice coefficients (Dice), which assess the overlap between predicted and ground truth regions. Our strategy effectively balances inference time and accuracy.
  • Figure 5: Comparison of training-free stylization methods: (a) original video frame, (b) key-value replacement StyleID, (c) key-value AdaIN, and (d) AdaIN-guided attention-shift. Our approach mitigates local detail loss.
  • ...and 13 more figures