Table of Contents
Fetching ...

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

Onkar Susladkar, Jishu Sen Gupta, Chirag Sehgal, Sparsh Mittal, Rekha Singhal

TL;DR

MotionAura tackles the challenge of producing high-quality, temporally coherent videos by combining a 3D-MBQ-VAE for discretized video tokens with a FFT-enabled spectral transformer denoiser that performs non-autoregressive diffusion. It introduces a sketch-guided video inpainting pathway using LoRA for parameter-efficient fine-tuning, and demonstrates SOTA performance on text-conditioned video generation and sketch-guided inpainting across multiple benchmarks. Key innovations include full-frame masking during VAE pretraining, discrete diffusion in latent space, Fourier-domain attention with RoPE, and conditional conditioning via rich captions and sketches. The work showcases practical impact through superior generation quality, longer video lengths, and efficient conditioning, and provides open-source code, datasets, and models to accelerate research and application. Overall, MotionAura advances spatiotemporal video modeling by unifying discrete latent representations, frequency-domain denoising, and user-guided content manipulation.

Abstract

The spatio-temporal complexity of video data presents significant challenges in tasks such as compression, generation, and inpainting. We present four key contributions to address the challenges of spatiotemporal video processing. First, we introduce the 3D Mobile Inverted Vector-Quantization Variational Autoencoder (3D-MBQ-VAE), which combines Variational Autoencoders (VAEs) with masked token modeling to enhance spatiotemporal video compression. The model achieves superior temporal consistency and state-of-the-art (SOTA) reconstruction quality by employing a novel training strategy with full frame masking. Second, we present MotionAura, a text-to-video generation framework that utilizes vector-quantized diffusion models to discretize the latent space and capture complex motion dynamics, producing temporally coherent videos aligned with text prompts. Third, we propose a spectral transformer-based denoising network that processes video data in the frequency domain using the Fourier Transform. This method effectively captures global context and long-range dependencies for high-quality video generation and denoising. Lastly, we introduce a downstream task of Sketch Guided Video Inpainting. This task leverages Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. Our models achieve SOTA performance on a range of benchmarks. Our work offers robust frameworks for spatiotemporal modeling and user-driven video content manipulation. We will release the code, datasets, and models in open-source.

MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion

TL;DR

MotionAura tackles the challenge of producing high-quality, temporally coherent videos by combining a 3D-MBQ-VAE for discretized video tokens with a FFT-enabled spectral transformer denoiser that performs non-autoregressive diffusion. It introduces a sketch-guided video inpainting pathway using LoRA for parameter-efficient fine-tuning, and demonstrates SOTA performance on text-conditioned video generation and sketch-guided inpainting across multiple benchmarks. Key innovations include full-frame masking during VAE pretraining, discrete diffusion in latent space, Fourier-domain attention with RoPE, and conditional conditioning via rich captions and sketches. The work showcases practical impact through superior generation quality, longer video lengths, and efficient conditioning, and provides open-source code, datasets, and models to accelerate research and application. Overall, MotionAura advances spatiotemporal video modeling by unifying discrete latent representations, frequency-domain denoising, and user-guided content manipulation.

Abstract

The spatio-temporal complexity of video data presents significant challenges in tasks such as compression, generation, and inpainting. We present four key contributions to address the challenges of spatiotemporal video processing. First, we introduce the 3D Mobile Inverted Vector-Quantization Variational Autoencoder (3D-MBQ-VAE), which combines Variational Autoencoders (VAEs) with masked token modeling to enhance spatiotemporal video compression. The model achieves superior temporal consistency and state-of-the-art (SOTA) reconstruction quality by employing a novel training strategy with full frame masking. Second, we present MotionAura, a text-to-video generation framework that utilizes vector-quantized diffusion models to discretize the latent space and capture complex motion dynamics, producing temporally coherent videos aligned with text prompts. Third, we propose a spectral transformer-based denoising network that processes video data in the frequency domain using the Fourier Transform. This method effectively captures global context and long-range dependencies for high-quality video generation and denoising. Lastly, we introduce a downstream task of Sketch Guided Video Inpainting. This task leverages Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. Our models achieve SOTA performance on a range of benchmarks. Our work offers robust frameworks for spatiotemporal modeling and user-driven video content manipulation. We will release the code, datasets, and models in open-source.

Paper Structure

This paper contains 44 sections, 5 equations, 30 figures, 11 tables.

Figures (30)

  • Figure 1: We introduce MotionAura, a novel Text-to-Video generation model that predicts discrete tokens obtained from our large scale pre-trained 3D VAE. The displayed frames represent videos generated by our model when provided with the captions shown below each frame. The following https://researchgroup12.github.io/Abstract_Diagram.html hosts the above generated videos along with other samples.
  • Figure 2: Our proposed pre-training method for 3D-MBQ-VAE architecture
  • Figure 3: Discrete diffusion pretraining of the spectral transformer involves processing tokenized video frame representations from the 3D-MBQ-VAE encoder. These representations are subjected to random masking based on a predefined probability distribution. The resulting corrupted tokens are then denoised through a series of $N$ Spectral Transformers. Contextual information from text representations generated by the T5-XXL-Encoder aids in this process. The denoised tokens are reconstructed using the 3D decoder
  • Figure 4: Architecture of spectral transformer
  • Figure 5: Sketch-guided video inpainting process. The network inputs masked video latents, fully diffused unmasked latent, sketch conditioning, and text conditioning. It predicts the denoised latents using LoRA infused in our pre-trained denoiser $\epsilon_{\theta}$.
  • ...and 25 more figures