Table of Contents
Fetching ...

Inversion-Free Video Style Transfer with Trajectory Reset Attention Control and Content-Style Bridging

Jiang Lin, Zili Yi

TL;DR

This work tackles content leakage and style misalignment in video style transfer by introducing inversion-free Trajectory Reset Attention Control (TRAC) and Style Medium as an intermediary bridge between content and style. TRAC preserves content by injecting auxiliary path content into the main diffusion path and by resetting the latent trajectory to follow the ideal forward diffusion path, avoiding costly inversion techniques and reducing computation. Style Medium uses disentangled style encoding guided by Multimodal Large Language Models (MLLMs) to align style with content elements, mitigating leakage and improving stylistic fidelity when combined with TRAC within a tuning-free diffusion framework that also leverages IP-Adapter and ControlNet for structural guidance. The proposed framework demonstrates strong image and video stylization performance with improved content integrity, temporal coherence, and efficiency, making it suitable for scalable, real-time-like applications in video editing and content creation. The key theoretical and practical contributions include $Q_{main}(t) ightarrow Q_{aux}(t)$ updates in self-attention, the forward-trajectory-based TRAC, and the Style Medium as a bridging representation for robust style transfer.$

Abstract

Video style transfer aims to alter the style of a video while preserving its content. Previous methods often struggle with content leakage and style misalignment, particularly when using image-driven approaches that aim to transfer precise styles. In this work, we introduce Trajectory Reset Attention Control (TRAC), a novel method that allows for high-quality style transfer while preserving content integrity. TRAC operates by resetting the denoising trajectory and enforcing attention control, thus enhancing content consistency while significantly reducing the computational costs against inversion-based methods. Additionally, a concept termed Style Medium is introduced to bridge the gap between content and style, enabling a more precise and harmonious transfer of stylistic elements. Building upon these concepts, we present a tuning-free framework that offers a stable, flexible, and efficient solution for both image and video style transfer. Experimental results demonstrate that our proposed framework accommodates a wide range of stylized outputs, from precise content preservation to the production of visually striking results with vibrant and expressive styles.

Inversion-Free Video Style Transfer with Trajectory Reset Attention Control and Content-Style Bridging

TL;DR

This work tackles content leakage and style misalignment in video style transfer by introducing inversion-free Trajectory Reset Attention Control (TRAC) and Style Medium as an intermediary bridge between content and style. TRAC preserves content by injecting auxiliary path content into the main diffusion path and by resetting the latent trajectory to follow the ideal forward diffusion path, avoiding costly inversion techniques and reducing computation. Style Medium uses disentangled style encoding guided by Multimodal Large Language Models (MLLMs) to align style with content elements, mitigating leakage and improving stylistic fidelity when combined with TRAC within a tuning-free diffusion framework that also leverages IP-Adapter and ControlNet for structural guidance. The proposed framework demonstrates strong image and video stylization performance with improved content integrity, temporal coherence, and efficiency, making it suitable for scalable, real-time-like applications in video editing and content creation. The key theoretical and practical contributions include updates in self-attention, the forward-trajectory-based TRAC, and the Style Medium as a bridging representation for robust style transfer.$

Abstract

Video style transfer aims to alter the style of a video while preserving its content. Previous methods often struggle with content leakage and style misalignment, particularly when using image-driven approaches that aim to transfer precise styles. In this work, we introduce Trajectory Reset Attention Control (TRAC), a novel method that allows for high-quality style transfer while preserving content integrity. TRAC operates by resetting the denoising trajectory and enforcing attention control, thus enhancing content consistency while significantly reducing the computational costs against inversion-based methods. Additionally, a concept termed Style Medium is introduced to bridge the gap between content and style, enabling a more precise and harmonious transfer of stylistic elements. Building upon these concepts, we present a tuning-free framework that offers a stable, flexible, and efficient solution for both image and video style transfer. Experimental results demonstrate that our proposed framework accommodates a wide range of stylized outputs, from precise content preservation to the production of visually striking results with vibrant and expressive styles.

Paper Structure

This paper contains 14 sections, 2 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Visualizations on Image style transfer.
  • Figure 2: Comparison on the effect of using Style Medium.
  • Figure 3: The video style transfer framework operates as follows: The source video is first processed by an MLLM model to generate two descriptions. The long description guides the content, while the style-specific features from the style reference inform the generation of the style medium. This style medium then serves as the new style reference for the video style transfer process. Meanwhile, the short description aids in generating the final stylized video.
  • Figure 4: Illustration of the prediction deviation as the diffusion process advances.
  • Figure 5: Comparison of image style transfer results.
  • ...and 2 more figures