Table of Contents
Fetching ...

AM Flow: Adapters for Temporal Processing in Action Recognition

Tanay Agrawal, Abid Ali, Antitza Dantcheva, Francois Bremond

TL;DR

This work proposes two methods to compute AM flow, depending on camera motion, and endsow an image model with the ability to achieve state-of-the-art results on popular action recognition datasets, by reducing the number of epochs needed for training.

Abstract

Deep learning models, in particular \textit{image} models, have recently gained generalisability and robustness. %are becoming more general and robust by the day. In this work, we propose to exploit such advances in the realm of \textit{video} classification. Video foundation models suffer from the requirement of extensive pretraining and a large training time. Towards mitigating such limitations, we propose "\textit{Attention Map (AM) Flow}" for image models, a method for identifying pixels relevant to motion in each input video frame. In this context, we propose two methods to compute AM flow, depending on camera motion. AM flow allows the separation of spatial and temporal processing, while providing improved results over combined spatio-temporal processing (as in video models). Adapters, one of the popular techniques in parameter efficient transfer learning, facilitate the incorporation of AM flow into pretrained image models, mitigating the need for full-finetuning. We extend adapters to "\textit{temporal processing adapters}" by incorporating a temporal processing unit into the adapters. Our work achieves faster convergence, therefore reducing the number of epochs needed for training. Moreover, we endow an image model with the ability to achieve state-of-the-art results on popular action recognition datasets. This reduces training time and simplifies pretraining. We present experiments on Kinetics-400, Something-Something v2, and Toyota Smarthome datasets, showcasing state-of-the-art or comparable results.

AM Flow: Adapters for Temporal Processing in Action Recognition

TL;DR

This work proposes two methods to compute AM flow, depending on camera motion, and endsow an image model with the ability to achieve state-of-the-art results on popular action recognition datasets, by reducing the number of epochs needed for training.

Abstract

Deep learning models, in particular \textit{image} models, have recently gained generalisability and robustness. %are becoming more general and robust by the day. In this work, we propose to exploit such advances in the realm of \textit{video} classification. Video foundation models suffer from the requirement of extensive pretraining and a large training time. Towards mitigating such limitations, we propose "\textit{Attention Map (AM) Flow}" for image models, a method for identifying pixels relevant to motion in each input video frame. In this context, we propose two methods to compute AM flow, depending on camera motion. AM flow allows the separation of spatial and temporal processing, while providing improved results over combined spatio-temporal processing (as in video models). Adapters, one of the popular techniques in parameter efficient transfer learning, facilitate the incorporation of AM flow into pretrained image models, mitigating the need for full-finetuning. We extend adapters to "\textit{temporal processing adapters}" by incorporating a temporal processing unit into the adapters. Our work achieves faster convergence, therefore reducing the number of epochs needed for training. Moreover, we endow an image model with the ability to achieve state-of-the-art results on popular action recognition datasets. This reduces training time and simplifies pretraining. We present experiments on Kinetics-400, Something-Something v2, and Toyota Smarthome datasets, showcasing state-of-the-art or comparable results.

Paper Structure

This paper contains 19 sections, 6 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Intuition of AM flow. No motion in the videos (frames) and minimal change in the attention map are represented by white/transparent. Here, rows with changes in the attention map correspond to input patches with motion, for example second and third patches have motion and the second and third rows of the attention maps have change. Note: the attention map is hand-crafted only for explanation and the colours have no intended meaning.
  • Figure 2: The middle part of the figure shows the frozen image model (ViT) (in red colour) with trainable additions: temporal processing adapters (in green) containing temporal processing units (in purple). The adapter across MHSA takes AM flow as input along with the input to the transformer block. $X_{A_t}$ is shown as computed inside $MHSA$ on the left (in yellow). On the right, $X_{A_t}$ and $X_{A_{t+1}}$ are used to compute $AM flow$ (in yellow). $t$ and $t+1$ signify different time-steps for the input frames. (Violet) shows the global temporal processing unit (TPU) and the classification head (squeeze and linear) added to it. All logits received from the temporal processing module (TPM) and the frozen model branch are averaged to obtain the final classification logits.
  • Figure 3: (a) Serial Adapters. (b) Parallel Adapters
  • Figure 4: This figure shows how AM flow ($X_A$) is computed in case there is a camera movement or motion in the background
  • Figure 5: Computed AM flow for two frames (on top) from Smarthome. Starting from top-left, going row-wise, AM flow is visualised for each transformer block in ViT-B from beginning to the end. The figure shows that we do not need to add AM flow to each layer and here for example, only the first and last layer are important.