Table of Contents
Fetching ...

RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks

Partha Ghosh, Soubhik Sanyal, Cordelia Schmid, Bernhard Schölkopf

TL;DR

This work presents a novel unconditional video generative model that incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a single latent code to model an entire video clip.

Abstract

We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies, with attention to computational and dataset efficiency. To capture long spatio-temporal dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a single latent code to model an entire video clip. Individual video frames are then synthesized from an intermediate tri-plane representation, which itself is derived from the primary latent code. This novel strategy more than halves the computational complexity measured in FLOPs compared to the most efficient state-of-the-art methods. Consequently, our approach facilitates the efficient and temporally coherent generation of videos. Moreover, our joint frame modeling approach, in contrast to autoregressive methods, mitigates the generation of visual artifacts. We further enhance the model's capabilities by integrating an optical flow-based module within our Generative Adversarial Network (GAN) based generator architecture, thereby compensating for the constraints imposed by a smaller generator size. As a result, our model synthesizes high-fidelity video clips at a resolution of $256\times256$ pixels, with durations extending to more than $5$ seconds at a frame rate of 30 fps. The efficacy and versatility of our approach are empirically validated through qualitative and quantitative assessments across three different datasets comprising both synthetic and real video clips. We will make our training and inference code public.

RAVEN: Rethinking Adversarial Video Generation with Efficient Tri-plane Networks

TL;DR

This work presents a novel unconditional video generative model that incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a single latent code to model an entire video clip.

Abstract

We present a novel unconditional video generative model designed to address long-term spatial and temporal dependencies, with attention to computational and dataset efficiency. To capture long spatio-temporal dependencies, our approach incorporates a hybrid explicit-implicit tri-plane representation inspired by 3D-aware generative frameworks developed for three-dimensional object representation and employs a single latent code to model an entire video clip. Individual video frames are then synthesized from an intermediate tri-plane representation, which itself is derived from the primary latent code. This novel strategy more than halves the computational complexity measured in FLOPs compared to the most efficient state-of-the-art methods. Consequently, our approach facilitates the efficient and temporally coherent generation of videos. Moreover, our joint frame modeling approach, in contrast to autoregressive methods, mitigates the generation of visual artifacts. We further enhance the model's capabilities by integrating an optical flow-based module within our Generative Adversarial Network (GAN) based generator architecture, thereby compensating for the constraints imposed by a smaller generator size. As a result, our model synthesizes high-fidelity video clips at a resolution of pixels, with durations extending to more than seconds at a frame rate of 30 fps. The efficacy and versatility of our approach are empirically validated through qualitative and quantitative assessments across three different datasets comprising both synthetic and real video clips. We will make our training and inference code public.
Paper Structure (17 sections, 3 figures, 4 tables)

This paper contains 17 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Here are samples of frames generated by our method, trained on three different datasets: Talking Faces, Webvid10M-Flowers, and Fashion videos. We select six consecutive frames spaced about 0.8 seconds apart to emphasize the motion between frames in the generated video. Then, we swap the red color channel between each pair of consecutive frames. Specifically, the red channel of the 0th frame is swapped with the 1st, the 2nd with the 3rd, and the 4th with the 5th. This results in three overlapping frames with visible motion across the color channels. The videos are available in the supplementary material.
  • Figure 2: Our video generation model comprises the following parts: a StyleGAN-t-based backbone, a tri-plane representation of motion, a flow decoder, a forward warping process, a super-resolution module, an image discriminator, and a video discriminator. This architecture allows efficient representation of video data thanks to the tri-plane representation. It also represents motion explicitly, thanks to the flow fields and warping mechanism. Finally, we discriminate the generated video in low resolution and random frames in high resolution, allowing for the generation of efficient high-resolution video of long duration. Here blocks in cyan color represent trainable modules and blocks in gray color represent fixed operation.
  • Figure 3: Qualitative results: Here in \ref{['Fig:qualitative_results_a']} and \ref{['Fig:qualitative_results_b']} we show the qualitative results generated by our method in comparison to MoCoGAN tulyakov2018mocogan, StyleGAN-V skorokhodov2022stylegan and stable video diffusion (SVD) blattmann2023stable. For visualization, we only show $6$ equally spaced frames from a video clip of length $5$ seconds for all models except for SVD. Since SVD only generates 25 frames, we simply visualize 6 equally spaced frames (i.e. we skip 4 frames in between every shown frame for SVD). We provide several generated videos from each model in the supplementary material.