Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation
Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan
TL;DR
Divot presents a diffusion-powered video tokenizer that learns continuous spatiotemporal video representations in a self-supervised fashion, enabling unified video comprehension and generation when paired with a large language model. The tokenizer combines a ViT encoder, a Spatial-Temporal Transformer, and a Perceiver Resampler to produce 64 tokens per two-second clip, which condition a diffusion-based de-tokenizer to reconstruct realistic videos. To handle generation, the authors model the distribution of continuous video features with a Gaussian Mixture Model (GMM) inside a pre-trained LLM (Divot-LLM), enabling probabilistic video generation conditioned on text prompts; autoregressive and query-based configurations are explored, with the query-based GMM approach performing best. Pre-training uses 4.8M WebVid-10M video-caption pairs plus image-text data, followed by multimodal instruction tuning and a domain-specific fine-tuning stage for video storytelling. Experiments show competitive performance on video comprehension benchmarks and quality text-to-video generation with relatively modest data, and qualitative results demonstrate temporally coherent storytelling. The work advances unified video understanding and creation by leveraging continuous representations and diffusion-based learning, offering practical impact for future video-capable LLMs.
Abstract
In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.
