Table of Contents
Fetching ...

HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models

Ziqin Zhou, Yifan Yang, Yuqing Yang, Tianyu He, Houwen Peng, Kai Qiu, Qi Dai, Lili Qiu, Chong Luo, Lingqiao Liu

TL;DR

HiTVideo tackles text-to-video generation by introducing hierarchical, multi-layer discrete video tokens encoded via a 3D causal VAE to capture semantic content and fine-grained spatiotemporal details across long sequences (64 frames at 8 FPS). The tokens feed an autoregressive LLM (Llama-3B) conditioned on text embeddings (Flan-T5-XL), with 3D RoPE position encodings and classifier-free guidance to bridge language and vision. The approach achieves about a 70% reduction in bits-per-pixel while maintaining reconstruction quality, and improves text-to-video alignment, enabling coherent long-sequence generation with a simpler modeling requirement for the LLM. This scalable framework advances text-to-video generation by balancing compression, reconstruction, and semantic alignment, and offers a pathway for integrating hierarchical tokenizers with diffusion or multi-modal systems in the future.

Abstract

Text-to-video generation poses significant challenges due to the inherent complexity of video data, which spans both temporal and spatial dimensions. It introduces additional redundancy, abrupt variations, and a domain gap between language and vision tokens while generation. Addressing these challenges requires an effective video tokenizer that can efficiently encode video data while preserving essential semantic and spatiotemporal information, serving as a critical bridge between text and vision. Inspired by the observation in VQ-VAE-2 and workflows of traditional animation, we propose HiTVideo for text-to-video generation with hierarchical tokenizers. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks. Higher layers capture semantic information with higher compression, while lower layers focus on fine-grained spatiotemporal details, striking a balance between compression efficiency and reconstruction quality. Our approach efficiently encodes longer video sequences (e.g., 8 seconds, 64 frames), reducing bits per pixel (bpp) by approximately 70\% compared to baseline tokenizers, while maintaining competitive reconstruction quality. We explore the trade-offs between compression and reconstruction, while emphasizing the advantages of high-compressed semantic tokens in text-to-video tasks. HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks, striving for higher compression ratios and simplify LLMs modeling under language guidance, offering a scalable and promising framework for advancing text to video generation. Demo page: https://ziqinzhou66.github.io/project/HiTVideo.

HiTVideo: Hierarchical Tokenizers for Enhancing Text-to-Video Generation with Autoregressive Large Language Models

TL;DR

HiTVideo tackles text-to-video generation by introducing hierarchical, multi-layer discrete video tokens encoded via a 3D causal VAE to capture semantic content and fine-grained spatiotemporal details across long sequences (64 frames at 8 FPS). The tokens feed an autoregressive LLM (Llama-3B) conditioned on text embeddings (Flan-T5-XL), with 3D RoPE position encodings and classifier-free guidance to bridge language and vision. The approach achieves about a 70% reduction in bits-per-pixel while maintaining reconstruction quality, and improves text-to-video alignment, enabling coherent long-sequence generation with a simpler modeling requirement for the LLM. This scalable framework advances text-to-video generation by balancing compression, reconstruction, and semantic alignment, and offers a pathway for integrating hierarchical tokenizers with diffusion or multi-modal systems in the future.

Abstract

Text-to-video generation poses significant challenges due to the inherent complexity of video data, which spans both temporal and spatial dimensions. It introduces additional redundancy, abrupt variations, and a domain gap between language and vision tokens while generation. Addressing these challenges requires an effective video tokenizer that can efficiently encode video data while preserving essential semantic and spatiotemporal information, serving as a critical bridge between text and vision. Inspired by the observation in VQ-VAE-2 and workflows of traditional animation, we propose HiTVideo for text-to-video generation with hierarchical tokenizers. It utilizes a 3D causal VAE with a multi-layer discrete token framework, encoding video content into hierarchically structured codebooks. Higher layers capture semantic information with higher compression, while lower layers focus on fine-grained spatiotemporal details, striking a balance between compression efficiency and reconstruction quality. Our approach efficiently encodes longer video sequences (e.g., 8 seconds, 64 frames), reducing bits per pixel (bpp) by approximately 70\% compared to baseline tokenizers, while maintaining competitive reconstruction quality. We explore the trade-offs between compression and reconstruction, while emphasizing the advantages of high-compressed semantic tokens in text-to-video tasks. HiTVideo aims to address the potential limitations of existing video tokenizers in text-to-video generation tasks, striving for higher compression ratios and simplify LLMs modeling under language guidance, offering a scalable and promising framework for advancing text to video generation. Demo page: https://ziqinzhou66.github.io/project/HiTVideo.

Paper Structure

This paper contains 15 sections, 2 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: The overall architecture of HiTVideo tokenizer.
  • Figure 2: Qualitative evaluation of our proposed hierarchical tokenizers. In each case, the left column represent partial frames sampled from the 64 input video, while the right column presents the corresponding reconstruction results.
  • Figure 3: Qualitative generative results of our HiTVideo tokenizers.
  • Figure 4: Comparison of using single and multi-layer video tokenizers as $64\times256\times256$ video resolution.
  • Figure 5: Comparison of qualitative and quantitative results across different input resolutions with same total number of video tokens.
  • ...and 6 more figures