Table of Contents
Fetching ...

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

Xinhao Li, Yuhan Zhu, Limin Wang

TL;DR

ZeroI2V addresses the challenge of transferring image transformers to video tasks without increasing inference cost. It introduces Spatial-Temporal Dual-Headed Attention (STDHA) to provide zero-cost temporal modeling by allocating a subset of attention heads to attend across frames with time offsets $Δt_i$, while the remaining heads handle spatial relations. It also employs densely placed linear adapters that are merged with the frozen backbone via structural reparameterization, enabling zero-cost inference after training. Across benchmarks like Kinetics-400 and Something-Something V2, ZeroI2V matches or surpasses state-of-the-art methods with substantially lower parameter counts and computation, including strong few-shot performance. The approach offers a practical, scalable backbone for efficient video understanding using existing image pre-trained transformers.

Abstract

Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks. Due to the huge number of parameters and effective transferability of image models, performing full fine-tuning is less efficient and even unnecessary. Thus, recent research is shifting its focus toward parameter-efficient image-to-video adaptation. However, these adaptation strategies inevitably introduce extra computational costs to deal with the domain gap and temporal modeling in videos. In this paper, we present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the original models during inference). To achieve this goal, we present two core designs. First, to capture the dynamics in videos and reduce the difficulty of image-to-video adaptation, we exploit the flexibility of self-attention and introduce spatial-temporal dual-headed attention (STDHA). This approach efficiently endows the image transformers with temporal modeling capability at zero extra parameters and computation. Second, to handle the domain gap between images and videos, we propose a linear adaption strategy that utilizes lightweight densely placed linear adapters to fully transfer the frozen image models to video recognition. Thanks to the customized linear design, all newly added adapters could be easily merged with the original modules through structural reparameterization after training, enabling zero extra cost during inference. Extensive experiments on representative fully-supervised and few-shot video recognition benchmarks showcase that ZeroI2V can match or even outperform previous state-of-the-art methods while enjoying superior parameter and inference efficiency.

ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to Video

TL;DR

ZeroI2V addresses the challenge of transferring image transformers to video tasks without increasing inference cost. It introduces Spatial-Temporal Dual-Headed Attention (STDHA) to provide zero-cost temporal modeling by allocating a subset of attention heads to attend across frames with time offsets , while the remaining heads handle spatial relations. It also employs densely placed linear adapters that are merged with the frozen backbone via structural reparameterization, enabling zero-cost inference after training. Across benchmarks like Kinetics-400 and Something-Something V2, ZeroI2V matches or surpasses state-of-the-art methods with substantially lower parameter counts and computation, including strong few-shot performance. The approach offers a practical, scalable backbone for efficient video understanding using existing image pre-trained transformers.

Abstract

Adapting image models to the video domain has emerged as an efficient paradigm for solving video recognition tasks. Due to the huge number of parameters and effective transferability of image models, performing full fine-tuning is less efficient and even unnecessary. Thus, recent research is shifting its focus toward parameter-efficient image-to-video adaptation. However, these adaptation strategies inevitably introduce extra computational costs to deal with the domain gap and temporal modeling in videos. In this paper, we present a new adaptation paradigm (ZeroI2V) to transfer the image transformers to video recognition tasks (i.e., introduce zero extra cost to the original models during inference). To achieve this goal, we present two core designs. First, to capture the dynamics in videos and reduce the difficulty of image-to-video adaptation, we exploit the flexibility of self-attention and introduce spatial-temporal dual-headed attention (STDHA). This approach efficiently endows the image transformers with temporal modeling capability at zero extra parameters and computation. Second, to handle the domain gap between images and videos, we propose a linear adaption strategy that utilizes lightweight densely placed linear adapters to fully transfer the frozen image models to video recognition. Thanks to the customized linear design, all newly added adapters could be easily merged with the original modules through structural reparameterization after training, enabling zero extra cost during inference. Extensive experiments on representative fully-supervised and few-shot video recognition benchmarks showcase that ZeroI2V can match or even outperform previous state-of-the-art methods while enjoying superior parameter and inference efficiency.
Paper Structure (22 sections, 6 equations, 3 figures, 14 tables)

This paper contains 22 sections, 6 equations, 3 figures, 14 tables.

Figures (3)

  • Figure 1: Left: Our proposed image-to-video transfer learning method. Right: Comparison of PETL mehods on SSv2 validation set. For a more intuitive comparison, the views of the methods in the figure are all 8$\times$3 $\times$1. Two core techniques enable us to achieve superior performance on video tasks without introducing additional computation and parameters during inference.
  • Figure 2: Illustration of the proposed linear adaptation and STDHA.
  • Figure 3: Visualization of attention maps of CLIP, spatial heads, temporal heads and STDHA at the last layer generated by Grad-CAM gradcam on SSv2 validation set.