Inversion-Free Video Style Transfer with Trajectory Reset Attention Control and Content-Style Bridging
Jiang Lin, Zili Yi
TL;DR
This work tackles content leakage and style misalignment in video style transfer by introducing inversion-free Trajectory Reset Attention Control (TRAC) and Style Medium as an intermediary bridge between content and style. TRAC preserves content by injecting auxiliary path content into the main diffusion path and by resetting the latent trajectory to follow the ideal forward diffusion path, avoiding costly inversion techniques and reducing computation. Style Medium uses disentangled style encoding guided by Multimodal Large Language Models (MLLMs) to align style with content elements, mitigating leakage and improving stylistic fidelity when combined with TRAC within a tuning-free diffusion framework that also leverages IP-Adapter and ControlNet for structural guidance. The proposed framework demonstrates strong image and video stylization performance with improved content integrity, temporal coherence, and efficiency, making it suitable for scalable, real-time-like applications in video editing and content creation. The key theoretical and practical contributions include $Q_{main}(t) ightarrow Q_{aux}(t)$ updates in self-attention, the forward-trajectory-based TRAC, and the Style Medium as a bridging representation for robust style transfer.$
Abstract
Video style transfer aims to alter the style of a video while preserving its content. Previous methods often struggle with content leakage and style misalignment, particularly when using image-driven approaches that aim to transfer precise styles. In this work, we introduce Trajectory Reset Attention Control (TRAC), a novel method that allows for high-quality style transfer while preserving content integrity. TRAC operates by resetting the denoising trajectory and enforcing attention control, thus enhancing content consistency while significantly reducing the computational costs against inversion-based methods. Additionally, a concept termed Style Medium is introduced to bridge the gap between content and style, enabling a more precise and harmonious transfer of stylistic elements. Building upon these concepts, we present a tuning-free framework that offers a stable, flexible, and efficient solution for both image and video style transfer. Experimental results demonstrate that our proposed framework accommodates a wide range of stylized outputs, from precise content preservation to the production of visually striking results with vibrant and expressive styles.
