Table of Contents
Fetching ...

A Comprehensive Study of Deep Video Action Recognition

Yi Zhu, Xinyu Li, Chunhui Liu, Mohammadreza Zolfaghari, Yuanjun Xiong, Chongruo Wu, Zhi Zhang, Joseph Tighe, R. Manmatha, Mu Li

TL;DR

This survey comprehensively documents the evolution of deep learning for video action recognition, tracing from hand-crafted features to two-stream and 3D CNN frameworks, and then to compute-efficient architectures. It catalogs ~17 influential datasets, benchmarks representative methods under consistent protocols, and releases reproducible code to enable fair comparisons. Key contributions include a structured chronological review, performance/efficiency benchmarking, and a discussion of open problems and future directions across data, models, and evaluation. The work highlights the ongoing shift toward efficient, multi-modal, and self-supervised approaches, along with the need for robust domain adaptation and new long-range temporal benchmarks to drive progress.

Abstract

Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling long-range temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition. We first introduce the 17 video action recognition datasets that influenced the design of models. Then we present video action recognition models in chronological order: starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. In addition, we benchmark popular methods on several representative datasets and release code for reproducibility. In the end, we discuss open problems and shed light on opportunities for video action recognition to facilitate new research ideas.

A Comprehensive Study of Deep Video Action Recognition

TL;DR

This survey comprehensively documents the evolution of deep learning for video action recognition, tracing from hand-crafted features to two-stream and 3D CNN frameworks, and then to compute-efficient architectures. It catalogs ~17 influential datasets, benchmarks representative methods under consistent protocols, and releases reproducible code to enable fair comparisons. Key contributions include a structured chronological review, performance/efficiency benchmarking, and a discussion of open problems and future directions across data, models, and evaluation. The work highlights the ongoing shift toward efficient, multi-modal, and self-supervised approaches, along with the need for robust domain adaptation and new long-range temporal benchmarks to drive progress.

Abstract

Video action recognition is one of the representative tasks for video understanding. Over the last decade, we have witnessed great advancements in video action recognition thanks to the emergence of deep learning. But we also encountered new challenges, including modeling long-range temporal information in videos, high computation costs, and incomparable results due to datasets and evaluation protocol variances. In this paper, we provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition. We first introduce the 17 video action recognition datasets that influenced the design of models. Then we present video action recognition models in chronological order: starting with early attempts at adapting deep learning, then to the two-stream networks, followed by the adoption of 3D convolutional kernels, and finally to the recent compute-efficient models. In addition, we benchmark popular methods on several representative datasets and release code for reproducibility. In the end, we discuss open problems and shed light on opportunities for video action recognition to facilitate new research ideas.

Paper Structure

This paper contains 49 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Visual examples of categories in popular video action datasets.
  • Figure 2: Statistics of most popular video action recognition datasets from past 10 years. The area of an circle represents the scale of each dataset (i.e., number of videos).
  • Figure 3: A chronological overview of recent representative work in video action recognition.
  • Figure 4: Visual examples from popular video action datasets. Top: individual video frames from action classes in UCF101 and Kinetics400. A single frame from these scene-focused datasets often contains enough information to correctly guess the category. Middle: consecutive video frames from classes in Something-Something. The 2nd and 3rd frames are made transparent to indicate the importance of temporal reasoning that we cannot tell these two actions apart by looking at the 1st frame alone. Bottom: individual video frames from classes in Moment in Time. Same action could have different actors in different environments.
  • Figure 5: Visualizations of optical flow. We show four image-flow pairs, left is original RGB image and right is the estimated optical flow by FlowNet2 flownet2. Color of optical flow indicates the directions of motion, and we follow the color coding scheme of FlowNet2 flownet2 as shown in top right.
  • ...and 1 more figures