Table of Contents
Fetching ...

VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents

Feng Wang, Yichun Shi, Ceyuan Yang, Qiushan Guo, Jingxiang Sun, Alan Yuille, Peng Wang

TL;DR

VTok introduces a unified video tokenizer that explicitly decouples spatial and temporal representations by encoding the first frame's spatial content as a set of tokens and summarizing subsequent frames with residual motion tokens, reducing the token budget from $O(T \times S)$ to $O(T+S)$. This tokenizer is embedded in a single autoregressive multimodal language model framework that handles both video understanding and text-to-video generation, with a diffusion-based video decoder to reconstruct or generate visuals. Empirical results on TV-Align, VBench, and multiple video-understanding benchmarks demonstrate improved temporal fidelity and semantic alignment, while using shorter token sequences than frame-sampling baselines. The approach generalizes across architectures (e.g., via a lightweight adapter to WAN2.2) and offers a standardized, efficient paradigm for future video-language modeling.

Abstract

This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision-language systems that tokenize videos through a naive frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and text-to-video generation benchmarks compared with baselines using naive tokenization, all with shorter token sequences per video (e.g., 3.4% higher accuracy on our TV-Align benchmark and 1.9% higher VBench score). Remarkably, VTok produces more coherent motion and stronger guidance following in text-to-video generation, owing to its more consistent temporal encoding. We hope VTok can serve as a standardized video tokenization paradigm for future research in video understanding and generation.

VTok: A Unified Video Tokenizer with Decoupled Spatial-Temporal Latents

TL;DR

VTok introduces a unified video tokenizer that explicitly decouples spatial and temporal representations by encoding the first frame's spatial content as a set of tokens and summarizing subsequent frames with residual motion tokens, reducing the token budget from to . This tokenizer is embedded in a single autoregressive multimodal language model framework that handles both video understanding and text-to-video generation, with a diffusion-based video decoder to reconstruct or generate visuals. Empirical results on TV-Align, VBench, and multiple video-understanding benchmarks demonstrate improved temporal fidelity and semantic alignment, while using shorter token sequences than frame-sampling baselines. The approach generalizes across architectures (e.g., via a lightweight adapter to WAN2.2) and offers a standardized, efficient paradigm for future video-language modeling.

Abstract

This work presents VTok, a unified video tokenization framework that can be used for both generation and understanding tasks. Unlike the leading vision-language systems that tokenize videos through a naive frame-sampling strategy, we propose to decouple the spatial and temporal representations of videos by retaining the spatial features of a single key frame while encoding each subsequent frame into a single residual token, achieving compact yet expressive video tokenization. Our experiments suggest that VTok effectively reduces the complexity of video representation from the product of frame count and per-frame token count to their sum, while the residual tokens sufficiently capture viewpoint and motion changes relative to the key frame. Extensive evaluations demonstrate the efficacy and efficiency of VTok: it achieves notably higher performance on a range of video understanding and text-to-video generation benchmarks compared with baselines using naive tokenization, all with shorter token sequences per video (e.g., 3.4% higher accuracy on our TV-Align benchmark and 1.9% higher VBench score). Remarkably, VTok produces more coherent motion and stronger guidance following in text-to-video generation, owing to its more consistent temporal encoding. We hope VTok can serve as a standardized video tokenization paradigm for future research in video understanding and generation.
Paper Structure (13 sections, 17 equations, 3 figures, 5 tables)

This paper contains 13 sections, 17 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Video tokenization strategies. Leading VLMs such as Qwen2.5-VL qwen25-vl still rely on a naïve frame-sampling approach to process videos. We contend that this strategy introduces excessive redundancy in spatial information while omitting temporal details. We find that video tokenization can be decomposed into spatial and temporal components, which preserves spatiotemporal information while minimizing the overall token length.
  • Figure 2: Video generation examples. We showcase the video generation results of the state-of-the-art open-source model WAN2.2 wan on our text-to-video alignment benchmark. By integrating VTok, the model demonstrates a more precise understanding of textual guidance, showing noticeably improved accuracy in following attributes such as object count, motion direction, position, and size.
  • Figure 3: Our unified framework of video understanding and text-to-video generation. The model integrates video understanding and generation within a single autoregressive multimodal large language model (MLLM). In the understanding branch, videos are first tokenized into key-frame and residual tokens, aligned with the language space, and processed together with a textual prompt to produce semantic outputs such as captions or answers. In the generation branch, the MLLM samples visual tokens conditioned on text, following the same spatial–temporal format, and a diffusion transformer can decode these tokens back to a video.