Table of Contents
Fetching ...

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

Yuying Ge, Yizhuo Li, Yixiao Ge, Ying Shan

TL;DR

Divot presents a diffusion-powered video tokenizer that learns continuous spatiotemporal video representations in a self-supervised fashion, enabling unified video comprehension and generation when paired with a large language model. The tokenizer combines a ViT encoder, a Spatial-Temporal Transformer, and a Perceiver Resampler to produce 64 tokens per two-second clip, which condition a diffusion-based de-tokenizer to reconstruct realistic videos. To handle generation, the authors model the distribution of continuous video features with a Gaussian Mixture Model (GMM) inside a pre-trained LLM (Divot-LLM), enabling probabilistic video generation conditioned on text prompts; autoregressive and query-based configurations are explored, with the query-based GMM approach performing best. Pre-training uses 4.8M WebVid-10M video-caption pairs plus image-text data, followed by multimodal instruction tuning and a domain-specific fine-tuning stage for video storytelling. Experiments show competitive performance on video comprehension benchmarks and quality text-to-video generation with relatively modest data, and qualitative results demonstrate temporally coherent storytelling. The work advances unified video understanding and creation by leveraging continuous representations and diffusion-based learning, offering practical impact for future video-capable LLMs.

Abstract

In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.

Divot: Diffusion Powers Video Tokenizer for Comprehension and Generation

TL;DR

Divot presents a diffusion-powered video tokenizer that learns continuous spatiotemporal video representations in a self-supervised fashion, enabling unified video comprehension and generation when paired with a large language model. The tokenizer combines a ViT encoder, a Spatial-Temporal Transformer, and a Perceiver Resampler to produce 64 tokens per two-second clip, which condition a diffusion-based de-tokenizer to reconstruct realistic videos. To handle generation, the authors model the distribution of continuous video features with a Gaussian Mixture Model (GMM) inside a pre-trained LLM (Divot-LLM), enabling probabilistic video generation conditioned on text prompts; autoregressive and query-based configurations are explored, with the query-based GMM approach performing best. Pre-training uses 4.8M WebVid-10M video-caption pairs plus image-text data, followed by multimodal instruction tuning and a domain-specific fine-tuning stage for video storytelling. Experiments show competitive performance on video comprehension benchmarks and quality text-to-video generation with relatively modest data, and qualitative results demonstrate temporally coherent storytelling. The work advances unified video understanding and creation by leveraging continuous representations and diffusion-based learning, offering practical impact for future video-capable LLMs.

Abstract

In recent years, there has been a significant surge of interest in unifying image comprehension and generation within Large Language Models (LLMs). This growing interest has prompted us to explore extending this unification to videos. The core challenge lies in developing a versatile video tokenizer that captures both the spatial characteristics and temporal dynamics of videos to obtain representations for LLMs, and the representations can be further decoded into realistic video clips to enable video generation. In this work, we introduce Divot, a Diffusion-Powered Video Tokenizer, which leverages the diffusion process for self-supervised video representation learning. We posit that if a video diffusion model can effectively de-noise video clips by taking the features of a video tokenizer as the condition, then the tokenizer has successfully captured robust spatial and temporal information. Additionally, the video diffusion model inherently functions as a de-tokenizer, decoding videos from their representations. Building upon the Divot tokenizer, we present Divot-Vicuna through video-to-text autoregression and text-to-video generation by modeling the distributions of continuous-valued Divot features with a Gaussian Mixture Model. Experimental results demonstrate that our diffusion-based video tokenizer, when integrated with a pre-trained LLM, achieves competitive performance across various video comprehension and generation benchmarks. The instruction tuned Divot-Vicuna also excels in video storytelling, generating interleaved narratives and corresponding videos.

Paper Structure

This paper contains 23 sections, 2 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: We utilize the diffusion procedure to learn a video tokenizer in a self-supervised manner for unified comprehension and generation, where the spatiotemporal representations serve as the condition of a diffusion model to de-noise video clips. Additionally, the proxy diffusion model functions as a de-tokenizer to decode realistic video clips from the video representations.
  • Figure 2: Overview of Divot tokenization and de-tokenization. During training, sparsely sampled video frames are fed into the tokenizer to obtain spatiotemporal representations. These representations serve as the conditions for a U-Net, which is trained to de-noise the noisy VAE latents of densely sampled video frames. During inference, the video representations from the Divot tokenizer can be decoded into realistic video clips with the U-Net.
  • Figure 3: Overview of Divot-LLM. Video features from the Divot tokenizer are fed into the LLM to perform next-word prediction for video comprehension, while learnable queries are input into the LLM to model the distributions of Divot features using a Gaussian Mixture Model (GMM) for video generation. During inference, video features are sampled from the predicted GMM distribution to decode videos using the de-tokenizer.
  • Figure 4: Paradigms for modeling video representations from the Divot tokenizer with a LLM for video generation. (a) MSE Regression, where the LLM output is trained to minimize its distance with video features using Mean Squared Error (MSE) loss; (b) Diffusion Modeling, where the LLM output is fed into a denoising network as the condition to predict the noise added to video features; (c) GMM Modeling, where the LLM output is trained to predict the parameters of a Gaussian Mixture Model (GMM) for modeling video feature distributions.
  • Figure 5: Reconstructed videos, where the Divot tokenizer obtains spatiotemporal representations of sparsely sampled video frames and the de-tokenizer decodes these representations into semantically aligned and temporally coherent video clips.
  • ...and 5 more figures