WAIT: Feature Warping for Animation to Illustration video Translation using GANs
Samet Hicsonmez, Nermin Samet, Fidan Samet, Oguz Bakir, Emre Akbas, Pinar Duygulu
TL;DR
This work tackles the problem of translating animation videos into illustration styles when only an unordered set of target images is available. The authors propose WAIT, a GAN-based generator with feature warping layers that enforces temporal coherence without relying on optical flow or temporal predictors, by warping offset features learned from a reference and an auxiliary frame. WAIT achieves state-of-the-art or competitive performance on multiple datasets (AS, BP, and Flowers) as measured by FID, KID, FWE, and MSE, while offering superior efficiency and simplicity compared to baselines like CycleGAN, ReCycleGAN, and I2V-GAN. The approach introduces an Offset Network and parallel Warping Layers to capture multi-scale temporal context, and ablations identify optimal depth, layer count, and time-gap parameters. Overall, WAIT advances video-to-video translation by enabling high-quality, temporally coherent stylization from unordered target images with reduced architectural complexity and computation.
Abstract
In this paper, we explore a new domain for video-to-video translation. Motivated by the availability of animation movies that are adopted from illustrated books for children, we aim to stylize these videos with the style of the original illustrations. Current state-of-the-art video-to-video translation models rely on having a video sequence or a single style image to stylize an input video. We introduce a new problem for video stylizing where an unordered set of images are used. This is a challenging task for two reasons: i) we do not have the advantage of temporal consistency as in video sequences; ii) it is more difficult to obtain consistent styles for video frames from a set of unordered images compared to using a single image. Most of the video-to-video translation methods are built on an image-to-image translation model, and integrate additional networks such as optical flow, or temporal predictors to capture temporal relations. These additional networks make the model training and inference complicated and slow down the process. To ensure temporal coherency in video-to-video style transfer, we propose a new generator network with feature warping layers which overcomes the limitations of the previous methods. We show the effectiveness of our method on three datasets both qualitatively and quantitatively. Code and pretrained models are available at https://github.com/giddyyupp/wait.
