Table of Contents
Fetching ...

COMUNI: Decomposing Common and Unique Video Signals for Diffusion-based Video Generation

Mingzhen Sun, Weining Wang, Xinxin Zhu, Jing Liu

TL;DR

A novel diffusion-based framework, named COMUNI, is proposed, which decomposes the COMmon and UNIque video signals to enable efficient video generation and separates the decomposition of video signals from the task of video generation, thus reducing the computation complexity of generative models.

Abstract

Since videos record objects moving coherently, adjacent video frames have commonness (similar object appearances) and uniqueness (slightly changed postures). To prevent redundant modeling of common video signals, we propose a novel diffusion-based framework, named COMUNI, which decomposes the COMmon and UNIque video signals to enable efficient video generation. Our approach separates the decomposition of video signals from the task of video generation, thus reducing the computation complexity of generative models. In particular, we introduce CU-VAE to decompose video signals and encode them into latent features. To train CU-VAE in a self-supervised manner, we employ a cascading merge module to reconstitute video signals and a time-agnostic video decoder to reconstruct video frames. Then we propose CU-LDM to model latent features for video generation, which adopts two specific diffusion streams to simultaneously model the common and unique latent features. We further utilize additional joint modules for cross modeling of the common and unique latent features, and a novel position embedding method to ensure the content consistency and motion coherence of generated videos. The position embedding method incorporates spatial and temporal absolute position information into the joint modules. Extensive experiments demonstrate the necessity of decomposing common and unique video signals for video generation and the effectiveness and efficiency of our proposed method.

COMUNI: Decomposing Common and Unique Video Signals for Diffusion-based Video Generation

TL;DR

A novel diffusion-based framework, named COMUNI, is proposed, which decomposes the COMmon and UNIque video signals to enable efficient video generation and separates the decomposition of video signals from the task of video generation, thus reducing the computation complexity of generative models.

Abstract

Since videos record objects moving coherently, adjacent video frames have commonness (similar object appearances) and uniqueness (slightly changed postures). To prevent redundant modeling of common video signals, we propose a novel diffusion-based framework, named COMUNI, which decomposes the COMmon and UNIque video signals to enable efficient video generation. Our approach separates the decomposition of video signals from the task of video generation, thus reducing the computation complexity of generative models. In particular, we introduce CU-VAE to decompose video signals and encode them into latent features. To train CU-VAE in a self-supervised manner, we employ a cascading merge module to reconstitute video signals and a time-agnostic video decoder to reconstruct video frames. Then we propose CU-LDM to model latent features for video generation, which adopts two specific diffusion streams to simultaneously model the common and unique latent features. We further utilize additional joint modules for cross modeling of the common and unique latent features, and a novel position embedding method to ensure the content consistency and motion coherence of generated videos. The position embedding method incorporates spatial and temporal absolute position information into the joint modules. Extensive experiments demonstrate the necessity of decomposing common and unique video signals for video generation and the effectiveness and efficiency of our proposed method.
Paper Structure (31 sections, 14 equations, 9 figures, 7 tables)

This paper contains 31 sections, 14 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Visualization of the decomposed common and unique video signals. The first column displays the real videos. The second column depicts the decomposed common video signal for each video, e.g. human characteristics and video backgrounds. The third column depicts decomposed unique video signals, e.g. changes of human expressions or body positions, which are in one-to-one correspondence with real video frames. The last column shows videos decoded from swapped common and unique video signals of two neighbor videos, demonstrating that our method could successfully decompose common and unique video signals and flexibly recompose them when decoding videos.
  • Figure 2: The overall architecture of the proposed two-stage framework COMUNI. During the first stage, CU-VAE decomposes common and unique video signals by extracting corresponding information and encoding it into latent features using two specific encoders: the commonness and uniqueness encoders. The merge module then recomposes these features in a cascading manner. Based on each fused feature, we adopt a time-agnostic video decoder to reconstruct corresponding video frame. In the second stage, CU-LDM employs two diffusion streams to model the common and unique latent features simultaneously. Multiple joint modules are interpolated to facilitate cross-modeling of different latent features.
  • Figure 3: Qualitative comparison with other models for video generation on FaceForensics $256^2$ and UCF-101 $128^2$.
  • Figure 4: Visualization of the temporal attention map in CU-LDM for conditional generation of the subsequent 8 unique features based on 8 synthesized unique features. We employ $t_1$ to denote the row temporal index and $t_2$ to denote the column temporal index of the map.
  • Figure 5: Visualization of a synthesized long video in distinct sampling steps. Video frames in the yellow rectangle are unconditionally generated as an initial video clip, and video frames in the green rectangle are conditionally produced using the iterative generation method.
  • ...and 4 more figures