Table of Contents
Fetching ...

StructuReiser: A Structure-preserving Video Stylization Method

Radim Spetlik, David Futschik, Daniel Sykora

TL;DR

StructuReiser is introduced, a novel video‐to‐video translation method that transforms input videos into stylized sequences using a set of user‐provided keyframes, enabling interactive applications and expanding possibilities for creative expression and video manipulation.

Abstract

We introduce StructuReiser, a novel video-to-video translation method that transforms input videos into stylized sequences using a set of user-provided keyframes. Unlike existing approaches, StructuReiser maintains strict adherence to the structural elements of the target video, preserving the original identity while seamlessly applying the desired stylistic transformations. This enables a level of control and consistency that was previously unattainable with traditional text-driven or keyframe-based methods. Furthermore, StructuReiser supports real-time inference and custom keyframe editing, making it ideal for interactive applications and expanding the possibilities for creative expression and video manipulation.

StructuReiser: A Structure-preserving Video Stylization Method

TL;DR

StructuReiser is introduced, a novel video‐to‐video translation method that transforms input videos into stylized sequences using a set of user‐provided keyframes, enabling interactive applications and expanding possibilities for creative expression and video manipulation.

Abstract

We introduce StructuReiser, a novel video-to-video translation method that transforms input videos into stylized sequences using a set of user-provided keyframes. Unlike existing approaches, StructuReiser maintains strict adherence to the structural elements of the target video, preserving the original identity while seamlessly applying the desired stylistic transformations. This enables a level of control and consistency that was previously unattainable with traditional text-driven or keyframe-based methods. Furthermore, StructuReiser supports real-time inference and custom keyframe editing, making it ideal for interactive applications and expanding the possibilities for creative expression and video manipulation.
Paper Structure (22 sections, 5 equations, 25 figures, 1 table)

This paper contains 22 sections, 5 equations, 25 figures, 1 table.

Figures (25)

  • Figure 1: StructuReiser transfers the style from a single stylized keyframe (a) to the entire video sequence (b) generating stylized frames (c) that are both stylistically consistent and structurally faithful. The keyframe (a) was created using the text-guided video-to-video diffusion model by Ceylan et al. ceylan_pix2video_2023. However, when applied directly to other frames in the sequence, this model often introduces significant structural inconsistencies (d). The state-of-the-art keyframe-based video stylization method of Futschik et al. futschik_stalp_2021 faces similar issues (e). In contrast, our approach (c) maintains the structural integrity of the target video sequence while ensuring coherent stylization throughout.
  • Figure 2: An overview of our approach. Given images from a source domain $\mathbf{y}_i \in \mathcal{Y}$, we optimize the operator $f$ to produce images $\hat{\mathbf{y}}_i$ with a similar appearance as images from a target domain $\hat{\mathbf{x}}_i$. The key loss $\mathcal{L}_{\text{key}}$ (\ref{['eq:loss_key']}) encourages reconstruction of keyframes $\hat{\mathbf{x}}_i$, the style loss $\mathcal{L}_{\text{style}}$ (\ref{['eq:loss_vgg']}) ensures style consistency between frames and keyframes using Gram correlation matrices $g$ of extracted VGG network responses $v$gatys_image_2016, and finally the structure loss $\mathcal{L}_{\text{structure}}$ (\ref{['eq:loss_sds']}) enforces fidelity to structural elements present in the input video frames $\mathbf{y}_i$. The structure loss requires a pre-trained ControlNet zhang_adding_2023 model consisting of diffusion model $d$ initialized by adding a random Gaussian noise $\mathbf{\epsilon}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ into the synthesized image $\hat{\mathbf{y}}_i$, a time step $t$, and a function $c$ transforming the input image $\mathbf{y}_i$ to a condition $\mathbf{c}$ (in this case Canny edge detector canny_computational_1986).
  • Figure 3: Results of our approach in comparison with the state-of-the-art in diffusion-based video stylization: The target video sequence (see a representative target frame $\mathbf{y}$) has been stylized using diffusion-based approaches (top row): (a) Ceylan et al. ceylan_pix2video_2023, (b) Yang et al. yang_rerender_2023, (c) Chu et al. chu_medm_2024, and (d) Geyer et al. geyer_tokenflow_2024. One frame from those stylized sequences was used as a keyframe (see small insets). The style of this keyframe has been propagated to the rest of the target sequence $\textbf{y} \in \mathcal{Y}$ using our approach (bottom row). Note how our approach better preserves the structural details seen in the target frame. Also, see our supplementary video to compare consistency across the entire sequence. Diffusion-based approaches tend to suffer from notable structural flicker, whereas our approach keeps the structure consistent, yielding considerably more stable results.
  • Figure 4: Results of our approach in comparison with the state-of-the-art in diffusion-based video stylization (cont.): See Fig. \ref{['fig:sota_comparison_dad']} for a detailed explanation.
  • Figure 5: Results of our approach in comparison with the state-of-the-art in diffusion-based video stylization (cont.): See Fig. \ref{['fig:sota_comparison_dad']} for a detailed explanation.
  • ...and 20 more figures