Table of Contents
Fetching ...

Dynamic and Compressive Adaptation of Transformers From Images to Videos

Guozhen Zhang, Jingyu Liu, Shengming Cao, Xiaotong Zhao, Kevin Zhao, Kai Ma, Limin Wang

TL;DR

This paper addresses the high computational cost of adapting image-pretrained Vision Transformers to video by proposing InTI, a compressive adaptation method that dynamically interpolates tokens across neighboring frames. InTI introduces a compression function between Transformer blocks and a multi-scale weight prediction network to perform point-wise, inter-frame token aggregation, effectively halving the number of frames processed per step while preserving spatiotemporal coherence. The method achieves substantial GFLOP reductions (about 37%) with competitive or improved Top-1 accuracy on Kinetics-400 (e.g., 87.1 with ViT-L, and 87.6 when combined with temporal modules), and shows strong transferability to SSv2 and UCF101/HMDB51. Ablation studies demonstrate the value of multi-scale contextual information and the superiority of SoftMax-based weight prediction, highlighting InTI’s potential as a flexible, efficient component for image-to-video transformer adaptation.

Abstract

Recently, the remarkable success of pre-trained Vision Transformers (ViTs) from image-text matching has sparked an interest in image-to-video adaptation. However, most current approaches retain the full forward pass for each frame, leading to a high computation overhead for processing entire videos. In this paper, we present InTI, a novel approach for compressive image-to-video adaptation using dynamic Inter-frame Token Interpolation. InTI aims to softly preserve the informative tokens without disrupting their coherent spatiotemporal structure. Specifically, each token pair at identical positions within neighbor frames is linearly aggregated into a new token, where the aggregation weights are generated by a multi-scale context-aware network. In this way, the information of neighbor frames can be adaptively compressed in a point-by-point manner, thereby effectively reducing the number of processed frames by half each time. Importantly, InTI can be seamlessly integrated with existing adaptation methods, achieving strong performance without extra-complex design. On Kinetics-400, InTI reaches a top-1 accuracy of 87.1 with a remarkable 37.5% reduction in GFLOPs compared to naive adaptation. When combined with additional temporal modules, InTI achieves a top-1 accuracy of 87.6 with a 37% reduction in GFLOPs. Similar conclusions have been verified in other common datasets.

Dynamic and Compressive Adaptation of Transformers From Images to Videos

TL;DR

This paper addresses the high computational cost of adapting image-pretrained Vision Transformers to video by proposing InTI, a compressive adaptation method that dynamically interpolates tokens across neighboring frames. InTI introduces a compression function between Transformer blocks and a multi-scale weight prediction network to perform point-wise, inter-frame token aggregation, effectively halving the number of frames processed per step while preserving spatiotemporal coherence. The method achieves substantial GFLOP reductions (about 37%) with competitive or improved Top-1 accuracy on Kinetics-400 (e.g., 87.1 with ViT-L, and 87.6 when combined with temporal modules), and shows strong transferability to SSv2 and UCF101/HMDB51. Ablation studies demonstrate the value of multi-scale contextual information and the superiority of SoftMax-based weight prediction, highlighting InTI’s potential as a flexible, efficient component for image-to-video transformer adaptation.

Abstract

Recently, the remarkable success of pre-trained Vision Transformers (ViTs) from image-text matching has sparked an interest in image-to-video adaptation. However, most current approaches retain the full forward pass for each frame, leading to a high computation overhead for processing entire videos. In this paper, we present InTI, a novel approach for compressive image-to-video adaptation using dynamic Inter-frame Token Interpolation. InTI aims to softly preserve the informative tokens without disrupting their coherent spatiotemporal structure. Specifically, each token pair at identical positions within neighbor frames is linearly aggregated into a new token, where the aggregation weights are generated by a multi-scale context-aware network. In this way, the information of neighbor frames can be adaptively compressed in a point-by-point manner, thereby effectively reducing the number of processed frames by half each time. Importantly, InTI can be seamlessly integrated with existing adaptation methods, achieving strong performance without extra-complex design. On Kinetics-400, InTI reaches a top-1 accuracy of 87.1 with a remarkable 37.5% reduction in GFLOPs compared to naive adaptation. When combined with additional temporal modules, InTI achieves a top-1 accuracy of 87.6 with a 37% reduction in GFLOPs. Similar conclusions have been verified in other common datasets.
Paper Structure (17 sections, 11 equations, 5 figures, 6 tables)

This paper contains 17 sections, 11 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of InTI and its performance on K400. (a) InTI softly aggregated tokens from neighbor frames with dynamically generated weights. (b) With InTI, we can achieve 87.1% top-1 accuracy with a 37.5% reduction in GFLOps. InTI can also be combined with any adaptation method, like li2023uniformerv2, for better performance.
  • Figure 2: Illustration of InTI. InTI dynamically aggregates tokens from neighbor frames by a learnable prediction network $\theta$.
  • Figure 3: Multi-Scale Information Extraction. We design four different lightweight networks for capturing contextual information on multiple scales for enhancing the spatiotemporal perception of weight prediction.
  • Figure 4: Visualization of predicted weights and the aggregated frame. As an example, the four frames are compressed twice into a single frame, and the predicted weights for each frame are obtained by cascadedly multiplying the predicted weights at each compression.
  • Figure 5: Addtional visualization of predicted weights and the aggregated frame. Each four frames is compressed twice into a single frame, and the predicted weights for each frame are obtained by cascadedly multiplying the predicted weights at each compression.