VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents
Feng Wang, Yichun Shi, Ceyuan Yang, Qiushan Guo, Jingxiang Sun, Alan Yuille, Peng Wang
TL;DR
VTok introduces a unified video tokenizer that explicitly decouples spatial and temporal representations by encoding the first frame's spatial content as a set of tokens and summarizing subsequent frames with residual motion tokens, reducing the token budget from $O(T \times S)$ to $O(T+S)$. This tokenizer is embedded in a single autoregressive multimodal language model framework that handles both video understanding and text-to-video generation, with a diffusion-based video decoder to reconstruct or generate visuals. Empirical results on TV-Align, VBench, and multiple video-understanding benchmarks demonstrate improved temporal fidelity and semantic alignment, while using shorter token sequences than frame-sampling baselines. The approach generalizes across architectures (e.g., via a lightweight adapter to WAN2.2) and offers a standardized, efficient paradigm for future video-language modeling.
Abstract
This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision-language systems that tokenize videos through a naive frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and text-to-video generation benchmarks compared with baselines using naive tokenization, all with shorter token sequences per video (e.g., 3.4% higher accuracy on our TV-Align benchmark and 1.9% higher VBench score). Remarkably, VTok produces more coherent motion and stronger guidance following in text-to-video generation, owing to its more consistent temporal encoding. We hope VTok can serve as a standardized video tokenization paradigm for future research in video understanding and generation.
