MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion
Onkar Susladkar, Jishu Sen Gupta, Chirag Sehgal, Sparsh Mittal, Rekha Singhal
TL;DR
MotionAura tackles the challenge of producing high-quality, temporally coherent videos by combining a 3D-MBQ-VAE for discretized video tokens with a FFT-enabled spectral transformer denoiser that performs non-autoregressive diffusion. It introduces a sketch-guided video inpainting pathway using LoRA for parameter-efficient fine-tuning, and demonstrates SOTA performance on text-conditioned video generation and sketch-guided inpainting across multiple benchmarks. Key innovations include full-frame masking during VAE pretraining, discrete diffusion in latent space, Fourier-domain attention with RoPE, and conditional conditioning via rich captions and sketches. The work showcases practical impact through superior generation quality, longer video lengths, and efficient conditioning, and provides open-source code, datasets, and models to accelerate research and application. Overall, MotionAura advances spatiotemporal video modeling by unifying discrete latent representations, frequency-domain denoising, and user-guided content manipulation.
Abstract
The spatio-temporal complexity of video data presents significant challenges in tasks such as compression, generation, and inpainting. We present four key contributions to address the challenges of spatiotemporal video processing. First, we introduce the 3D Mobile Inverted Vector-Quantization Variational Autoencoder (3D-MBQ-VAE), which combines Variational Autoencoders (VAEs) with masked token modeling to enhance spatiotemporal video compression. The model achieves superior temporal consistency and state-of-the-art (SOTA) reconstruction quality by employing a novel training strategy with full frame masking. Second, we present MotionAura, a text-to-video generation framework that utilizes vector-quantized diffusion models to discretize the latent space and capture complex motion dynamics, producing temporally coherent videos aligned with text prompts. Third, we propose a spectral transformer-based denoising network that processes video data in the frequency domain using the Fourier Transform. This method effectively captures global context and long-range dependencies for high-quality video generation and denoising. Lastly, we introduce a downstream task of Sketch Guided Video Inpainting. This task leverages Low-Rank Adaptation (LoRA) for parameter-efficient fine-tuning. Our models achieve SOTA performance on a range of benchmarks. Our work offers robust frameworks for spatiotemporal modeling and user-driven video content manipulation. We will release the code, datasets, and models in open-source.
