Table of Contents
Fetching ...

PickStyle: Video-to-Video Style Transfer with Context-Style Adapters

Soroush Mehraban, Vida Adeli, Jacob Rommann, Babak Taati, Kyryl Truskovskyi

TL;DR

PickStyle tackles video-to-video style transfer under limited paired video data by augmenting a pretrained video diffusion backbone with lightweight context-style adapters trained on image-pair supervision and synthetic motion. It introduces Context-Style Classifier-Free Guidance (CS-CFG) to separately steer style and content during denoising, and a noise initialization strategy that starts from a partially noised version of the input video to preserve motion priors. The method achieves temporally coherent, style-faithful translations across diverse styles and outperforms state-of-the-art baselines on multiple quantitative and qualitative metrics. This work enables high-quality controllable video stylization with reduced supervision and shows promise for applying diffusion-based models to video stylization tasks.

Abstract

We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.

PickStyle: Video-to-Video Style Transfer with Context-Style Adapters

TL;DR

PickStyle tackles video-to-video style transfer under limited paired video data by augmenting a pretrained video diffusion backbone with lightweight context-style adapters trained on image-pair supervision and synthetic motion. It introduces Context-Style Classifier-Free Guidance (CS-CFG) to separately steer style and content during denoising, and a noise initialization strategy that starts from a partially noised version of the input video to preserve motion priors. The method achieves temporally coherent, style-faithful translations across diverse styles and outperforms state-of-the-art baselines on multiple quantitative and qualitative metrics. This work enables high-quality controllable video stylization with reduced supervision and shows promise for applying diffusion-based models to video stylization tasks.

Abstract

We address the task of video style transfer with diffusion models, where the goal is to preserve the context of an input video while rendering it in a target style specified by a text prompt. A major challenge is the lack of paired video data for supervision. We propose PickStyle, a video-to-video style transfer framework that augments pretrained video diffusion backbones with style adapters and benefits from paired still image data with source-style correspondences for training. PickStyle inserts low-rank adapters into the self-attention layers of conditioning modules, enabling efficient specialization for motion-style transfer while maintaining strong alignment between video content and style. To bridge the gap between static image supervision and dynamic video, we construct synthetic training clips from paired images by applying shared augmentations that simulate camera motion, ensuring temporal priors are preserved. In addition, we introduce Context-Style Classifier-Free Guidance (CS-CFG), a novel factorization of classifier-free guidance into independent text (style) and video (context) directions. CS-CFG ensures that context is preserved in generated video while the style is effectively transferred. Experiments across benchmarks show that our approach achieves temporally coherent, style-faithful, and content-preserving video translations, outperforming existing baselines both qualitatively and quantitatively.

Paper Structure

This paper contains 21 sections, 15 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: PickStyle addresses video-to-video style transfer by preserving motion and context while translating videos into diverse styles. Unlike prior methods that treat the task as artistic style transfer (color–texture statistics while ignoring geometric properties of the target style) and that often suffer from style degradation, visual inconsistency and temporal flicker, PickStyle produces coherent translations across nine styles.
  • Figure 2: Training and inference pipeline of PickStyle. In training (left), both the style image and the context image are transformed into video tokens and context tokens with synthetic camera motion using motion augmentation; video tokens are noised and denoised conditioned on context tokens by the DiT-based PickStyle model with context-style adapters. In inference (right), a context video and a style description are encoded and iteratively denoised under text, context, and null conditions, where the proposed CS--CFG applies spatiotemporal permutation to the null context to generate the final styled video.
  • Figure 3: Comparison on CSD Score and inference cost, per one second of generated video. Inference is evaluated on a single H100 GPU.
  • Figure 4: Qualitative comparison of PickStyle , Control-a-Video, Rerender, FRESCO, and FLATTEN in LEGO and anime styles.
  • Figure 5: Qualitative evaluation of PickStyle on a non-photorealistic example rendered in Unity3D.
  • ...and 7 more figures