Table of Contents
Fetching ...

TrajTok: Learning Trajectory Tokens enables better Video Understanding

Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna

TL;DR

TrajTok is an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration.

Abstract

Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

TrajTok: Learning Trajectory Tokens enables better Video Understanding

TL;DR

TrajTok is an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration.

Abstract

Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.
Paper Structure (21 sections, 2 equations, 9 figures, 9 tables)

This paper contains 21 sections, 2 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: (a) Traditional video tokenization splits a video into space-time patches, introducing large number of redundant tokens. (b) Prior work zheng2025trajvit proposes to represent a video via panoptic sub-object trajectory, which significantly reduces redundancy but relies on slow, non-differentiable pipelines. (c) we propose TrajTok, an end-to-end differentiable trajectory tokenizer that learns to implicitly propose trajectory tokens, offering low token counts, efficiency and adaptability to downstream objectives.
  • Figure 2: Overview of the TrajTok architecture. TrajTok comprises a trajectory segmenter and a trajectory encoder. The segmenter proposes trajectory masks for all objects in an image or video within a single forward pass. The encoder then aggregates raw video pixels or encoded visual features (parameterized by $f$ in the figure) according to these masks to produce trajectory tokens. The number of tokens per trajectory can be flexibly adjusted based on the available compute budget.
  • Figure 3: Training with downstream understanding tasks reshapes the segmentation granularity. We visualize the trajectory masks produced by our segmenter when trained with only segmentation supervision versus jointly with segmentation and CLIP objectives. The CLIP objective reshapes the segmentation granularity, producing finer foreground object masks while merging background regions.
  • Figure 4: TrajTok is a versatile module applicable across pretraining, feature adaptation, and finetuning stages. We demonstrate its use in three scenarios: TrajViT2, which trains a visual encoder from scratch; TrajAdapter, which adapts pretrained features for downstream tasks; and TrajVLM, which uses TrajTok as a connector in LLaVA-style large vision–language models.
  • Figure 5: Scaling with video training data. TrajViT2 exhibits stronger scaling behavior than TrajViT and sustains a consistent performance margin over ViT3D at every data scale.
  • ...and 4 more figures