Table of Contents
Fetching ...

SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization

Zhentao Tan, Ben Xue, Jian Jia, Junhao Wang, Wencai Ye, Shaoyun Shi, Mingjie Sun, Wenjin Wu, Quan Chen, Peng Jiang

TL;DR

SweetTok addresses the inefficiency of video tokenization by decoupling spatial and temporal compression through a Decoupled Query AutoEncoder and augmenting semantic fidelity with a Motion-enhanced Language Codebook. The method achieves high reconstruction fidelity at a notably reduced token count and demonstrates strong performance in video generation and few-shot semantic tasks via LLMs. Across UCF-101 and K-600, SweetTok outperforms baselines in rFVD and gFVD, while enabling image reconstruction improvements and semantic understanding, suggesting practical utility for compact video representations in generation and recognition pipelines.

Abstract

This paper presents the \textbf{S}emantic-a\textbf{W}ar\textbf{E} spatial-t\textbf{E}mporal \textbf{T}okenizer (SweetTok), a novel video tokenizer to overcome the limitations in current video tokenization methods for compacted yet effective discretization. Unlike previous approaches that process flattened local visual patches via direct discretization or adaptive query tokenization, SweetTok proposes a decoupling framework, compressing visual inputs through distinct spatial and temporal queries via \textbf{D}ecoupled \textbf{Q}uery \textbf{A}uto\textbf{E}ncoder (DQAE). This design allows SweetTok to efficiently compress video token count while achieving superior fidelity by capturing essential information across spatial and temporal dimensions. Furthermore, we design a \textbf{M}otion-enhanced \textbf{L}anguage \textbf{C}odebook (MLC) tailored for spatial and temporal compression to address the differences in semantic representation between appearance and motion information. SweetTok significantly improves video reconstruction results by \textbf{42.8\%} w.r.t rFVD on UCF-101 dataset. With a better token compression strategy, it also boosts downstream video generation results by \textbf{15.1\%} w.r.t gFVD. Additionally, the compressed decoupled tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.

SweetTok: Semantic-Aware Spatial-Temporal Tokenizer for Compact Video Discretization

TL;DR

SweetTok addresses the inefficiency of video tokenization by decoupling spatial and temporal compression through a Decoupled Query AutoEncoder and augmenting semantic fidelity with a Motion-enhanced Language Codebook. The method achieves high reconstruction fidelity at a notably reduced token count and demonstrates strong performance in video generation and few-shot semantic tasks via LLMs. Across UCF-101 and K-600, SweetTok outperforms baselines in rFVD and gFVD, while enabling image reconstruction improvements and semantic understanding, suggesting practical utility for compact video representations in generation and recognition pipelines.

Abstract

This paper presents the \textbf{S}emantic-a\textbf{W}ar\textbf{E} spatial-t\textbf{E}mporal \textbf{T}okenizer (SweetTok), a novel video tokenizer to overcome the limitations in current video tokenization methods for compacted yet effective discretization. Unlike previous approaches that process flattened local visual patches via direct discretization or adaptive query tokenization, SweetTok proposes a decoupling framework, compressing visual inputs through distinct spatial and temporal queries via \textbf{D}ecoupled \textbf{Q}uery \textbf{A}uto\textbf{E}ncoder (DQAE). This design allows SweetTok to efficiently compress video token count while achieving superior fidelity by capturing essential information across spatial and temporal dimensions. Furthermore, we design a \textbf{M}otion-enhanced \textbf{L}anguage \textbf{C}odebook (MLC) tailored for spatial and temporal compression to address the differences in semantic representation between appearance and motion information. SweetTok significantly improves video reconstruction results by \textbf{42.8\%} w.r.t rFVD on UCF-101 dataset. With a better token compression strategy, it also boosts downstream video generation results by \textbf{15.1\%} w.r.t gFVD. Additionally, the compressed decoupled tokens are imbued with semantic information, enabling few-shot recognition capabilities powered by LLMs in downstream applications.

Paper Structure

This paper contains 39 sections, 10 equations, 20 figures, 9 tables.

Figures (20)

  • Figure 1: Illustration of our framework. We build a compact visual latent space by reducing token count in a decoupled style and leveraging motion-enhanced semantic text embedding. The encoded tokens can be applied to downstream tasks, such as generation and understanding.
  • Figure 2: Pipeline overview. (a) Vanilla video tokenizers directly quantize flattened video patches. (b) Vanilla query-based tokenizers compress flattend video patches into adaptive queries. (c) SweetTok proposes decoupled query-based autoencoder (DQAE, §\ref{['sec:3.1']}). The spatial encoder quantizes the first frame's patch embeddings, while the temporal encoder quantizes residual between consecutive frames. The spatial decoder reconstructs the first frame's patches, replicates them $T$ times, and passes them to the temporal decoder for final information fusion and reconstruction. It also proposes motion-enhanced language codebook (MLC, §\ref{['sec:3.2']}) to complement reconstructed video information via action-related language semantics.
  • Figure 3: The semantics of spatial-temporal "words". The attention weights of the last encoder's cross-attention layer are visualized via heatmap, showing the visual regions corresponding to the related latent words.
  • Figure 7: Comparison of the reconstruction results of OmniTokenizer and SweetTok on UCF-101 dataset, where "Diff" represents the pixel difference between the ground truth and the models.
  • Figure 8: Comparison of the reconstruction results of OmniTokenizer and SweetTok on K-600 dataset, where "Diff" represents the pixel difference between the ground truth and the models.
  • ...and 15 more figures