Table of Contents
Fetching ...

WAIT: Feature Warping for Animation to Illustration video Translation using GANs

Samet Hicsonmez, Nermin Samet, Fidan Samet, Oguz Bakir, Emre Akbas, Pinar Duygulu

TL;DR

This work tackles the problem of translating animation videos into illustration styles when only an unordered set of target images is available. The authors propose WAIT, a GAN-based generator with feature warping layers that enforces temporal coherence without relying on optical flow or temporal predictors, by warping offset features learned from a reference and an auxiliary frame. WAIT achieves state-of-the-art or competitive performance on multiple datasets (AS, BP, and Flowers) as measured by FID, KID, FWE, and MSE, while offering superior efficiency and simplicity compared to baselines like CycleGAN, ReCycleGAN, and I2V-GAN. The approach introduces an Offset Network and parallel Warping Layers to capture multi-scale temporal context, and ablations identify optimal depth, layer count, and time-gap parameters. Overall, WAIT advances video-to-video translation by enabling high-quality, temporally coherent stylization from unordered target images with reduced architectural complexity and computation.

Abstract

In this paper, we explore a new domain for video-to-video translation. Motivated by the availability of animation movies that are adopted from illustrated books for children, we aim to stylize these videos with the style of the original illustrations. Current state-of-the-art video-to-video translation models rely on having a video sequence or a single style image to stylize an input video. We introduce a new problem for video stylizing where an unordered set of images are used. This is a challenging task for two reasons: i) we do not have the advantage of temporal consistency as in video sequences; ii) it is more difficult to obtain consistent styles for video frames from a set of unordered images compared to using a single image. Most of the video-to-video translation methods are built on an image-to-image translation model, and integrate additional networks such as optical flow, or temporal predictors to capture temporal relations. These additional networks make the model training and inference complicated and slow down the process. To ensure temporal coherency in video-to-video style transfer, we propose a new generator network with feature warping layers which overcomes the limitations of the previous methods. We show the effectiveness of our method on three datasets both qualitatively and quantitatively. Code and pretrained models are available at https://github.com/giddyyupp/wait.

WAIT: Feature Warping for Animation to Illustration video Translation using GANs

TL;DR

This work tackles the problem of translating animation videos into illustration styles when only an unordered set of target images is available. The authors propose WAIT, a GAN-based generator with feature warping layers that enforces temporal coherence without relying on optical flow or temporal predictors, by warping offset features learned from a reference and an auxiliary frame. WAIT achieves state-of-the-art or competitive performance on multiple datasets (AS, BP, and Flowers) as measured by FID, KID, FWE, and MSE, while offering superior efficiency and simplicity compared to baselines like CycleGAN, ReCycleGAN, and I2V-GAN. The approach introduces an Offset Network and parallel Warping Layers to capture multi-scale temporal context, and ablations identify optimal depth, layer count, and time-gap parameters. Overall, WAIT advances video-to-video translation by enabling high-quality, temporally coherent stylization from unordered target images with reduced architectural complexity and computation.

Abstract

In this paper, we explore a new domain for video-to-video translation. Motivated by the availability of animation movies that are adopted from illustrated books for children, we aim to stylize these videos with the style of the original illustrations. Current state-of-the-art video-to-video translation models rely on having a video sequence or a single style image to stylize an input video. We introduce a new problem for video stylizing where an unordered set of images are used. This is a challenging task for two reasons: i) we do not have the advantage of temporal consistency as in video sequences; ii) it is more difficult to obtain consistent styles for video frames from a set of unordered images compared to using a single image. Most of the video-to-video translation methods are built on an image-to-image translation model, and integrate additional networks such as optical flow, or temporal predictors to capture temporal relations. These additional networks make the model training and inference complicated and slow down the process. To ensure temporal coherency in video-to-video style transfer, we propose a new generator network with feature warping layers which overcomes the limitations of the previous methods. We show the effectiveness of our method on three datasets both qualitatively and quantitatively. Code and pretrained models are available at https://github.com/giddyyupp/wait.
Paper Structure (21 sections, 8 equations, 5 figures, 6 tables)

This paper contains 21 sections, 8 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Left: Sequences from animation movies adopted from stories illustrated by Axel Scheffler (AS) and Beatrix Potter (BP) where temporal relation exists. Right: Pages taken from various books corresponding to the same illustrators. Note that, even when the same characters are present both in animations and illustrations, their resemblance is limited. Colors and styles are also different in movies and books. Moreover, on BP dataset, in some illustrations corners are left blank.
  • Figure 2: High level comparisons of baseline models and the proposed method WAIT. (a) CycleGAN has two generator networks: $G_X$ (red) and $G_Y$ (blue). These networks successfully capture the target styles but fails to ensure temporal coherency. (b) ReCycleGAN, has two additional temporal predictor networks, $P_X$ and $P_Y$, on top of CycleGAN to fulfill the missing feature, i.e. temporal relations. $P_X$ and $P_Y$ takes the two preceding frames, $X_{t}$ and $X_{t-1}$, and predicts the subsequent frame $X_{t+1}$. (c) OpticalFlowWarping uses Optical Flow to capture temporal relations between generated frames. Rectangle boxes correspond to an pre-trained optical flow prediction network. Predicted flow values are used to warp with a previously generated frame to get stylized current frame. (d) Our method WAIT has only two networks to capture both the target style and the temporal coherency. It does not have any external network components and its design is as simple as CycleGAN. Temporal information is incorporated through feature warping layers inside the generator network. For all models, dashed lines display the loss calculations specific to the model.
  • Figure 3: Detailed description of the generator network of WAIT. Our model takes two images, input frame $I_t$ and auxiliary frame $I_{t+\delta}$ as inputs and forwards them through the Backbone CNN to extract feature maps $F_t$ and $F_{t+\delta}$, respectively. Then, we calculate the difference of these two feature maps, $F_{diff}$. We forward this difference map to Offset Network and calculate offset features $F_{offset}$. As a final stage, we warp the offset features $F_{offset}$ with the auxiliary features $F_{t+\delta}$ to create final translated frame. Warping stage contains 5 parallel layers to capture features with different resolutions.
  • Figure 4: Visual results on AS dataset. Top row displays a short sequence and on the bottom a long sequence from our test set. Leftmost column contains the input videos and the following columns correspond to results of the baseline methods and WAIT. First of all, for both sequences only WAIT captures target style correctly. In terms of visual quality and temporal coherency, on the first sequence red and black patches on the mouth of the horse are visible for every baseline. On the bottom sequence, the success of WAIT is more visible. For the CycleGAN, OpticalFlowWarp and ReCycleGAN results, it is very obvious that the temporal coherency is not captured. Especially, the color of the background grass and air changes between frames. Also, there are white/black patches/holes moving across the scene for baselines. The quality of the generated frames and temporal coherency between them are very visible for the results of WAIT. Zoom in for details. Each cell is a short video clip, for the best view experience consider using a compatible (e.g. Adobe Reader) PDF reader.
  • Figure 5: Visual results on BP dataset. We display a short sequence on the top row, and a long sequence at the bottom row from our test set. Leftmost column contains the input videos and the following columns correspond to the results of baseline methods and WAIT respectively. First of all, similar to AS results, for both sequences only WAIT captures the target style correctly. In terms of visual quality and temporal coherency, for the first sequence, colors of the leaves and the background changes abruptly for CycleGAN, OpticalFlowWarp and ReCycleGANv2 results. For ReCycleGAN results, there is a white patch on the apple for some frames. On the bottom sequence which is a 5 second clip, the success of WAIT is easy to catch. The style is correctly captured and the temporal coherency is ensured. For the baselines, strangely CycleGAN captures the temporal coherency much better than others. The only defect in CycleGAN results is the pinkish colorization on the wall and the head of the rabbit. For the rest, it is very obvious that the temporal coherency is not captured. The general color palette of the scene changes dramatically between frames. Also, there are white patches/holes on the pumpkin for ReCycleGANv2 results. Zoom in for details. Each cell is a short video clip, for the best view experience consider using a compatible (e.g. Adobe Reader) PDF reader.