Table of Contents
Fetching ...

Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

Huiwon Jang, Sihyun Yu, Jinwoo Shin, Pieter Abbeel, Younggyo Seo

TL;DR

CoordTok introduces a scalable video tokenizer that encodes long videos by learning a coordinate-based mapping from randomly sampled $(x,y,t)$ coordinates to patches, using factorized triplane latents ${\mathbf{z}}=[{\mathbf{z}}^{xy}, {\mathbf{z}}^{yt}, {\mathbf{z}}^{xt}]$. The encoder produces these three 2D planes, while the decoder queries coordinates via bilinear interpolation and applies self-attention to fuse information for patch reconstruction, trained with $\ell_2$ loss and optional LPIPS fine-tuning. Empirically, CoordTok achieves dramatic token compression (e.g., a 128-frame video at $128\times128$ can be encoded in roughly 1280 tokens, vs 6144–8192 for baselines) and enables memory-efficient long-video generation with diffusion transformers, achieving state-of-the-art FVD on 128-frame videos. Analyses show effects of model size, triplane resolution, coordinate representations, and sampling strategies, with limitations on highly dynamic content and suggestions for future improvements such as multiple content planes and adaptive encoding. Overall, the work provides a practical path toward scalable long-context video tokens and more efficient long-video synthesis and understanding.

Abstract

Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled $(x,y,t)$ coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128$\times$128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.

Efficient Long Video Tokenization via Coordinate-based Patch Reconstruction

TL;DR

CoordTok introduces a scalable video tokenizer that encodes long videos by learning a coordinate-based mapping from randomly sampled coordinates to patches, using factorized triplane latents . The encoder produces these three 2D planes, while the decoder queries coordinates via bilinear interpolation and applies self-attention to fuse information for patch reconstruction, trained with loss and optional LPIPS fine-tuning. Empirically, CoordTok achieves dramatic token compression (e.g., a 128-frame video at can be encoded in roughly 1280 tokens, vs 6144–8192 for baselines) and enables memory-efficient long-video generation with diffusion transformers, achieving state-of-the-art FVD on 128-frame videos. Analyses show effects of model size, triplane resolution, coordinate representations, and sampling strategies, with limitations on highly dynamic content and suggestions for future improvements such as multiple content planes and adaptive encoding. Overall, the work provides a practical path toward scalable long-context video tokens and more efficient long-video synthesis and understanding.

Abstract

Efficient tokenization of videos remains a challenge in training vision models that can process long videos. One promising direction is to develop a tokenizer that can encode long video clips, as it would enable the tokenizer to leverage the temporal coherence of videos better for tokenization. However, training existing tokenizers on long videos often incurs a huge training cost as they are trained to reconstruct all the frames at once. In this paper, we introduce CoordTok, a video tokenizer that learns a mapping from coordinate-based representations to the corresponding patches of input videos, inspired by recent advances in 3D generative models. In particular, CoordTok encodes a video into factorized triplane representations and reconstructs patches that correspond to randomly sampled coordinates. This allows for training large tokenizer models directly on long videos without requiring excessive training resources. Our experiments show that CoordTok can drastically reduce the number of tokens for encoding long video clips. For instance, CoordTok can encode a 128-frame video with 128128 resolution into 1280 tokens, while baselines need 6144 or 8192 tokens to achieve similar reconstruction quality. We further show that this efficient video tokenization enables memory-efficient training of a diffusion transformer that can generate 128 frames at once.

Paper Structure

This paper contains 51 sections, 2 equations, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Limitation of existing video tokenizers. (a) Existing video tokenizers ge2022longyu2023videowang2024larp are often not scalable to long videos because of excessive memory and computational demands. This is because they are trained to reconstruct all video frames at once, i.e., a giant 3D array of pixels, which incurs a huge computation and memory burden in training especially when trained on long videos. For instance, PVDM-AE yu2023video becomes out-of-memory when trained to encode 128-frame videos when using a single NVIDIA 4090 24GB GPU. (b) As a result, existing tokenizers are typically trained to encode up to 16-frame videos and struggle to capture the temporal coherence of videos.
  • Figure 2: Overview of CoordTok. We design our encoder to encode a video ${\mathbf{x}}$ into factorized triplane representations ${\mathbf{z}} = [{\mathbf{z}}^{xy}, {\mathbf{z}}^{yt}, {\mathbf{z}}^{xt}]$ which can efficiently represent the video with three 2D latent planes. Given the triplane representations $\mathbf{z}$, our decoder learns a mapping from $(x,y,t)$ coordinates to RGB pixels within the corresponding patches. In particular, we extract coordinate-based representations of $N$ sampled coordinates by querying the coordinates from triplane representations via bilinear interpolation. Then the decoder aggregates and fuses information from different coordinates with self-attention layers and project outputs into corresponding patches. This design enables us to train tokenizers on long videos in a compute-efficient manner by avoiding reconstruction of entire frames at once.
  • Figure 2: FVDs of video generation models on the UCF-101 dataset (128-frame, 128$\times$128 resolution). $\downarrow$ indicates lower values are better.
  • Figure 3: 128-frame, 128$\times$128 resolution video reconstruction results from CoordTok (Ours) and baselines yu2023videowang2024larp trained on the UCF-101 dataset soomro2012ucf101. For each frame, we visualize the ground-truth (GT) and reconstructed pixels within the region highlighted in the red box, where CoordTok achieves noticeably better reconstruction quality than other baselines.
  • Figure 4: CoordTok can efficiently encode long videos. rFVD scores of video tokenizers, evaluated on 128-frame videos, with respect to the token size. $\downarrow$ indicates lower values are better.
  • ...and 9 more figures