Table of Contents
Fetching ...

A Survey on Backbones for Deep Video Action Recognition

Zixuan Tang, Youjun Zhao, Yuhang Wen, Mengyuan Liu

TL;DR

This survey examines backbones for video action recognition, detailing three core families: two-stream networks, 3D CNNs, and Transformer-based models, with emphasis on capturing spatial and temporal cues for video understanding. It covers representative architectures and design strategies, including I3D-style inflated convolutions, SlowFast and non-local enhancements, ViViT and Video Swin Transformer style transformers, and multiscale/multiview variants. The analysis finds that Transformer-based backbones generally outperform CNN-based approaches on key benchmarks, but real-time, resource-constrained deployment requires efficient designs such as lightweight Side4Video. The authors highlight future directions like efficient multiscale transformers, hybrid CNN-Transformer models, and data-efficient pretraining to meet metaverse-scale video understanding requirements.

Abstract

Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.

A Survey on Backbones for Deep Video Action Recognition

TL;DR

This survey examines backbones for video action recognition, detailing three core families: two-stream networks, 3D CNNs, and Transformer-based models, with emphasis on capturing spatial and temporal cues for video understanding. It covers representative architectures and design strategies, including I3D-style inflated convolutions, SlowFast and non-local enhancements, ViViT and Video Swin Transformer style transformers, and multiscale/multiview variants. The analysis finds that Transformer-based backbones generally outperform CNN-based approaches on key benchmarks, but real-time, resource-constrained deployment requires efficient designs such as lightweight Side4Video. The authors highlight future directions like efficient multiscale transformers, hybrid CNN-Transformer models, and data-efficient pretraining to meet metaverse-scale video understanding requirements.

Abstract

Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.
Paper Structure (15 sections, 2 figures, 1 table)

This paper contains 15 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Over the last decade, many video action recognition datasets with various labels have been proposed, which contributes to the advancement of action recognition tasks.
  • Figure 2: We review deep neural network backbones in video action recognition. As shown in Fig. \ref{['network']}, we demonstrate the general architecture of (a) Two-stream networks and (b) 3D CNN. Moreover, we review two ways of improving Transformer for action recognition: designing different kinds of attention mechanisms (c) or introducing multi-scale/multi-view features(d) into the model.