A Survey on Backbones for Deep Video Action Recognition

Zixuan Tang; Youjun Zhao; Yuhang Wen; Mengyuan Liu

A Survey on Backbones for Deep Video Action Recognition

Zixuan Tang, Youjun Zhao, Yuhang Wen, Mengyuan Liu

TL;DR

This survey examines backbones for video action recognition, detailing three core families: two-stream networks, 3D CNNs, and Transformer-based models, with emphasis on capturing spatial and temporal cues for video understanding. It covers representative architectures and design strategies, including I3D-style inflated convolutions, SlowFast and non-local enhancements, ViViT and Video Swin Transformer style transformers, and multiscale/multiview variants. The analysis finds that Transformer-based backbones generally outperform CNN-based approaches on key benchmarks, but real-time, resource-constrained deployment requires efficient designs such as lightweight Side4Video. The authors highlight future directions like efficient multiscale transformers, hybrid CNN-Transformer models, and data-efficient pretraining to meet metaverse-scale video understanding requirements.

Abstract

Action recognition is a key technology in building interactive metaverses. With the rapid development of deep learning, methods in action recognition have also achieved great advancement. Researchers design and implement the backbones referring to multiple standpoints, which leads to the diversity of methods and encountering new challenges. This paper reviews several action recognition methods based on deep neural networks. We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding. We offer objective sights in this review and hopefully provide a reference for future research.

A Survey on Backbones for Deep Video Action Recognition

TL;DR

Abstract

Paper Structure (15 sections, 2 figures, 1 table)

This paper contains 15 sections, 2 figures, 1 table.

Introduction
Backbones of Action Recognition
Two-Streams Networks in Action Recognition
Two-Stream Networks
Multi-Stream Networks
3D CNNs
Inspiration from Image Domain
Spatiotemporal Semantic Information
Transformer-based Neural Network
Transformer-based Architectures and Spatiotemporal Attention Design
Multiscale and Multiview Transformers
Integration of Transformer and CNN
Comparison
Conclusion
Acknowledgement

Figures (2)

Figure 1: Over the last decade, many video action recognition datasets with various labels have been proposed, which contributes to the advancement of action recognition tasks.
Figure 2: We review deep neural network backbones in video action recognition. As shown in Fig. \ref{['network']}, we demonstrate the general architecture of (a) Two-stream networks and (b) 3D CNN. Moreover, we review two ways of improving Transformer for action recognition: designing different kinds of attention mechanisms (c) or introducing multi-scale/multi-view features(d) into the model.

A Survey on Backbones for Deep Video Action Recognition

TL;DR

Abstract

A Survey on Backbones for Deep Video Action Recognition

Authors

TL;DR

Abstract

Table of Contents

Figures (2)