Table of Contents
Fetching ...

Selective Volume Mixup for Video Action Recognition

Yi Tan, Zhaofan Qiu, Yanbin Hao, Ting Yao, Tao Mei

TL;DR

A novel video augmentation strategy named Selective Volume Mixup (SV-Mix) is proposed to improve the generalization ability of deep models with limited training videos and empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks.

Abstract

The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for each frame individually including Mixup, Cutmix, and RandAugment, which are not particularly optimized for video data. In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos. SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes the volumes up to achieve a new training video. Technically, we propose two new modules, i.e., a spatial selective module to select the local patches for each spatial position, and a temporal selective module to mix the entire frames for each timestamp and maintain the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy. We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boot the performances of both CNN-based and transformer-based models.

Selective Volume Mixup for Video Action Recognition

TL;DR

A novel video augmentation strategy named Selective Volume Mixup (SV-Mix) is proposed to improve the generalization ability of deep models with limited training videos and empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks.

Abstract

The recent advances in Convolutional Neural Networks (CNNs) and Vision Transformers have convincingly demonstrated high learning capability for video action recognition on large datasets. Nevertheless, deep models often suffer from the overfitting effect on small-scale datasets with a limited number of training videos. A common solution is to exploit the existing image augmentation strategies for each frame individually including Mixup, Cutmix, and RandAugment, which are not particularly optimized for video data. In this paper, we propose a novel video augmentation strategy named Selective Volume Mixup (SV-Mix) to improve the generalization ability of deep models with limited training videos. SV-Mix devises a learnable selective module to choose the most informative volumes from two videos and mixes the volumes up to achieve a new training video. Technically, we propose two new modules, i.e., a spatial selective module to select the local patches for each spatial position, and a temporal selective module to mix the entire frames for each timestamp and maintain the spatial pattern. At each time, we randomly choose one of the two modules to expand the diversity of training samples. The selective modules are jointly optimized with the video action recognition framework to find the optimal augmentation strategy. We empirically demonstrate the merits of the SV-Mix augmentation on a wide range of video action recognition benchmarks and consistently boot the performances of both CNN-based and transformer-based models.
Paper Structure (13 sections, 11 equations, 9 figures, 8 tables)

This paper contains 13 sections, 11 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: The intuition of (a) the typical Cutmix yun2019cutmix and Mixup zhangmixup augmentations on video data, and (b) our Selective Volume Mixup (SV-Mix). The typical methods randomly combine regions or entire frames from two videos and may lose crucial information. In contrast, our SV-Mix contains learnable selective modules to adaptively select valuable volumes. A tapas performance comparison between Cutmix/Mixup and our SV-Mix is also shown in (c).
  • Figure 2: The overview of our Selective Volume Mixup (SV-Mix) data augmentation. Given two training videos, we first extract the volume-level feature map with shape $\mathbb{R}^{T'\times H'\times W'\times C}$ for each video by an encoder $f_\phi^t(\cdot)$. Next, we randomly instantiate a specific volume partition function $Vol(\cdot)$ to partition the feature map as $V^{spa}$ or $V^{tem}$ which respectively corresponding to patch level selection (spatial selective module) and frame level selection (temporal selective module). The selected volumes are then combined together to achieve an augmented video as the input of the subsequent action recognition framework.
  • Figure 3: A diagram of the volume selection module in our SV-Mix. Given the volumes from two videos $v_i,v_j\in \mathbb{R}^{N\times C}$ where $N$ and $C$ are the number of volumes and channels, respectively, the attention weights across two videos are calculated through three trainable linear mappings $W_q,W_k,W_v$. Here $\lambda$ denotes the label proportion of the first video.
  • Figure 4: The proposed disentangled training pipeline for jointly optimizing SV-Mix and the action recognition networks $f_\phi^s$. In this pipeline, the gradient of SV-Mix is provided by a momentum-updated version of action recognition network $f_\phi^t$. Therefore, the gradients of SV-Mix and $f_\phi^s$ are disentangled, which stabilizes the training process.
  • Figure 5: Instance visualization of mixing two videos labeled as "Pretending to close sth" (Video A) and "Tearing sth into two pieces" (Video B). We compare mix videos generated by spatial selective module and temporal selective module under disentangled training and entangled training to verify the importance of training disentanglement. Both spatial and temporal selective modules fail to capture the informative spatial/temporal volumes and mix video samples in a uniform manner.
  • ...and 4 more figures