Table of Contents
Fetching ...

Video Token Merging for Long-form Video Understanding

Seon-Ho Lee, Jue Wang, Zhikang Zhang, David Fan, Xinyu Li

TL;DR

A learnable video token merging (VTM) algorithm that dynamically merges tokens based on their saliency is proposed, which significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.

Abstract

As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loss, token merging shows promising results when used in collaboration with transformers. However, the application of token merging for long-form video processing is not trivial. We begin with the premise that token merging should not rely solely on the similarity of video tokens; the saliency of tokens should also be considered. To address this, we explore various video token merging strategies for long-form video classification, starting with a simple extension of image token merging, moving to region-concentrated merging, and finally proposing a learnable video token merging (VTM) algorithm that dynamically merges tokens based on their saliency. Extensive experimental results show that we achieve better or comparable performances on the LVU, COIN, and Breakfast datasets. Moreover, our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.

Video Token Merging for Long-form Video Understanding

TL;DR

A learnable video token merging (VTM) algorithm that dynamically merges tokens based on their saliency is proposed, which significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.

Abstract

As the scale of data and models for video understanding rapidly expand, handling long-form video input in transformer-based models presents a practical challenge. Rather than resorting to input sampling or token dropping, which may result in information loss, token merging shows promising results when used in collaboration with transformers. However, the application of token merging for long-form video processing is not trivial. We begin with the premise that token merging should not rely solely on the similarity of video tokens; the saliency of tokens should also be considered. To address this, we explore various video token merging strategies for long-form video classification, starting with a simple extension of image token merging, moving to region-concentrated merging, and finally proposing a learnable video token merging (VTM) algorithm that dynamically merges tokens based on their saliency. Extensive experimental results show that we achieve better or comparable performances on the LVU, COIN, and Breakfast datasets. Moreover, our approach significantly reduces memory costs by 84% and boosts throughput by approximately 6.89 times compared to baseline algorithms.

Paper Structure

This paper contains 21 sections, 9 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Comparison of GPU memory footprint and throughput against scene prediction accuracy on the LVU dataset wu2021lvu.
  • Figure 2: The architectures of (a) the baseline network, (b) the transformer block, and (c) the video token merging block.
  • Figure 3: Visualizations of target tokens of different VTM methods: (a) naïve VTM, (b) center-concentrated VTM, (c) motion-based VTM, and (d) learnable VTM. In (d), learnable VTM selects the target tokens around salient objects rather than backgrounds.
  • Figure 4: An overview of the learnable video token merging block. The auxiliary path is used during training only.
  • Figure 5: Visualizations of video token merging results on the LVU dataset. Patches with same inner and border color are merged together. The tokens corresponding to the backgrounds are merged together, thereby increasing the influence of salient tokens in the attention process.
  • ...and 2 more figures