Motion Guided Token Compression for Efficient Masked Video Modeling

Yukun Feng; Yangming Shi; Fengze Liu; Tan Yan

Motion Guided Token Compression for Efficient Masked Video Modeling

Yukun Feng, Yangming Shi, Fengze Liu, Tan Yan

TL;DR

The paper tackles the challenge of quadratic attention cost in Transformer-based video modeling by advocating higher FPS for richer motion information, which traditionally increases redundancy and computation. It introduces Motion Guided Token Compression (MGTC), a lightweight, motion-informed masking strategy that preserves informative patches across time by masking low-change blocks based on temporal patch differences and a per-video threshold. MGTC can be applied during inference or training and is demonstrated to improve top-1 accuracy while reducing FLOPs; higher FPS combined with MGTC yields noticeable gains on Kinetics-400, UCF101, and HMDB51. The approach leverages ideas from video compression and masked vision modeling to achieve efficient video representation, with detailed ablations showing MGTC outperforms other masking strategies across various settings. The results suggest MGTC as a practical way to exploit higher FPS for better video understanding under fixed computational budgets, with potential for broader adoption in FPS-driven video analysis.

Abstract

Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O($N^2$) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.

Motion Guided Token Compression for Efficient Masked Video Modeling

TL;DR

Abstract

Recent developments in Transformers have achieved notable strides in enhancing video comprehension. Nonetheless, the O(

) computation complexity associated with attention mechanisms presents substantial computational hurdles when dealing with the high dimensionality of videos. This challenge becomes particularly pronounced when striving to increase the frames per second (FPS) to enhance the motion capturing capabilities. Such a pursuit is likely to introduce redundancy and exacerbate the existing computational limitations. In this paper, we initiate by showcasing the enhanced performance achieved through an escalation in the FPS rate. Additionally, we present a novel approach, Motion Guided Token Compression (MGTC), to empower Transformer models to utilize a smaller yet more representative set of tokens for comprehensive video representation. Consequently, this yields substantial reductions in computational burden and remains seamlessly adaptable to increased FPS rates. Specifically, we draw inspiration from video compression algorithms and scrutinize the variance between patches in consecutive video frames across the temporal dimension. The tokens exhibiting a disparity below a predetermined threshold are then masked. Notably, this masking strategy effectively addresses video redundancy while conserving essential information. Our experiments, conducted on widely examined video recognition datasets, Kinetics-400, UCF101 and HMDB51, demonstrate that elevating the FPS rate results in a significant top-1 accuracy score improvement of over 1.6, 1.6 and 4.0. By implementing MGTC with the masking ratio of 25\%, we further augment accuracy by 0.1 and simultaneously reduce computational costs by over 31\% on Kinetics-400. Even within a fixed computational budget, higher FPS rates paired with MGTC sustain performance gains when compared to lower FPS settings.

Paper Structure (24 sections, 1 equation, 5 figures, 1 table)

This paper contains 24 sections, 1 equation, 5 figures, 1 table.

Introduction
Related Work
Video Action Recognition
Masked Vision Modeling
Video Compression
Methodology
Motion Guided Token Compression
Sub-block Division
Block Masking
Training and Evaluation
Evaluation with MGTC
Training with MGTC
Experiments
Datasets
Settings
...and 9 more sections

Figures (5)

Figure 1: Overview of the Top-1 Accuracy under different FPS on Kinetics-400 and UCF101. The relative circle size symbolizes the comparative extent of the computational capacity, which is controlled by the masking ratio and FPS.
Figure 2: Comparison between different masking methods, under various FPS rate. MGTC is able to capture the action movement, and remove the redundant information, especially in higher FPS rate. Here we use a masking ratio of 50%.
Figure 3: Workflow of motion-guided masking. Patch differences are calculated for further masking.
Figure 4: Pixel-Residual Distributions under 12 FPS.
Figure :

Motion Guided Token Compression for Efficient Masked Video Modeling

TL;DR

Abstract

Motion Guided Token Compression for Efficient Masked Video Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (5)