Table of Contents
Fetching ...

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zenghui Ding, Xianjun Yang, Yining Sun

TL;DR

The paper presents DYTo, a training-free dynamic token merging framework for zero-shot video understanding that combines coarse-grained hierarchical frame clustering with fine-grained bipartite token merging to maintain semantic richness while reducing token counts. By adaptively clustering key frames across temporal scales and selectively compressing tokens within frames, DYTo achieves strong zero-shot performance across diverse structured and open-ended VQA benchmarks, often surpassing fine-tuned and other training-free methods. The approach scales with model size, demonstrates robustness to longer video sequences, and is supported by extensive ablations, visualizations, and qualitative case studies. Overall, DYTo offers a scalable, efficient, and robust solution for zero-shot video understanding that relaxes the need for extensive labeled data and fine-tuning, enabling practical deployment on large video corpora.

Abstract

Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

TL;DR

The paper presents DYTo, a training-free dynamic token merging framework for zero-shot video understanding that combines coarse-grained hierarchical frame clustering with fine-grained bipartite token merging to maintain semantic richness while reducing token counts. By adaptively clustering key frames across temporal scales and selectively compressing tokens within frames, DYTo achieves strong zero-shot performance across diverse structured and open-ended VQA benchmarks, often surpassing fine-tuned and other training-free methods. The approach scales with model size, demonstrates robustness to longer video sequences, and is supported by extensive ablations, visualizations, and qualitative case studies. Overall, DYTo offers a scalable, efficient, and robust solution for zero-shot video understanding that relaxes the need for extensive labeled data and fine-tuning, enabling practical deployment on large video corpora.

Abstract

Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.

Paper Structure

This paper contains 27 sections, 4 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Comparison with two SOTA training-free video-based LLM decoding methods over three different model backbones across five video benchmarks. DyTo and other baselines are marked using solid (—) and dashed (- - -) lines, respectively. DyTo outperforms existing training-free SOTA methods on almost all the benchmarks and achieves even better performance than most SFT-based methods.
  • Figure 2: The overview of DyTo, a training-free model built upon image-based MLLM without any fine-tuning. Specifically, DyTo first divides the video into $K$ clusters using the [CLS] token (pink block). Then the dynamic bipartite merging module samples frames from each cluster and controls the final output length as $Z$, resulting in better balance between computational efficiency and semantic richness.
  • Figure 3: Top: Performance comparison of baseline method under various video lengths. Bottom: Effect of different input sampling lengths under various video lengths.
  • Figure 4: The sampling method and clustering module output visualization on a video. Our method offers more comprehensive video representation frames compared to other methods.
  • Figure 5: Clustering module output example from videos. Colors indicate different events in temporal order. The differences are clearly visible in the video clips.
  • ...and 4 more figures