Table of Contents
Fetching ...

RNNs, CNNs and Transformers in Human Action Recognition: A Survey and a Hybrid Model

Khaled Alomar, Halil Ibrahim Aysel, Xiaohao Cai

TL;DR

The paper surveys CNN-, RNN-, and Vision Transformer (ViT)-based approaches for HAR, highlighting the strengths and limitations of each paradigm in capturing spatial and temporal information. It introduces a novel CNN-ViT hybrid model that uses a TimeDistributed CNN backbone to extract per-frame features, followed by a ViT to model temporal dependencies and classify actions. Through experiments on the KTH dataset, the hybrid model achieves top performance (up to 97.89% on 24-frame sequences), demonstrating the benefit of fusing local spatial features with global temporal attention. The work underscores the growing importance of hybrid architectures in HAR, offering a path toward more robust, data-efficient, and real-time action recognition across domains like surveillance, healthcare, and entertainment.

Abstract

Human Action Recognition (HAR) encompasses the task of monitoring human activities across various domains, including but not limited to medical, educational, entertainment, visual surveillance, video retrieval, and the identification of anomalous activities. Over the past decade, the field of HAR has witnessed substantial progress by leveraging Convolutional Neural Networks (CNNs) to effectively extract and comprehend intricate information, thereby enhancing the overall performance of HAR systems. Recently, the domain of computer vision has witnessed the emergence of Vision Transformers (ViTs) as a potent solution. The efficacy of transformer architecture has been validated beyond the confines of image analysis, extending their applicability to diverse video-related tasks. Notably, within this landscape, the research community has shown keen interest in HAR, acknowledging its manifold utility and widespread adoption across various domains. This article aims to present an encompassing survey that focuses on CNNs and the evolution of Recurrent Neural Networks (RNNs) to ViTs given their importance in the domain of HAR. By conducting a thorough examination of existing literature and exploring emerging trends, this study undertakes a critical analysis and synthesis of the accumulated knowledge in this field. Additionally, it investigates the ongoing efforts to develop hybrid approaches. Following this direction, this article presents a novel hybrid model that seeks to integrate the inherent strengths of CNNs and ViTs.

RNNs, CNNs and Transformers in Human Action Recognition: A Survey and a Hybrid Model

TL;DR

The paper surveys CNN-, RNN-, and Vision Transformer (ViT)-based approaches for HAR, highlighting the strengths and limitations of each paradigm in capturing spatial and temporal information. It introduces a novel CNN-ViT hybrid model that uses a TimeDistributed CNN backbone to extract per-frame features, followed by a ViT to model temporal dependencies and classify actions. Through experiments on the KTH dataset, the hybrid model achieves top performance (up to 97.89% on 24-frame sequences), demonstrating the benefit of fusing local spatial features with global temporal attention. The work underscores the growing importance of hybrid architectures in HAR, offering a path toward more robust, data-efficient, and real-time action recognition across domains like surveillance, healthcare, and entertainment.

Abstract

Human Action Recognition (HAR) encompasses the task of monitoring human activities across various domains, including but not limited to medical, educational, entertainment, visual surveillance, video retrieval, and the identification of anomalous activities. Over the past decade, the field of HAR has witnessed substantial progress by leveraging Convolutional Neural Networks (CNNs) to effectively extract and comprehend intricate information, thereby enhancing the overall performance of HAR systems. Recently, the domain of computer vision has witnessed the emergence of Vision Transformers (ViTs) as a potent solution. The efficacy of transformer architecture has been validated beyond the confines of image analysis, extending their applicability to diverse video-related tasks. Notably, within this landscape, the research community has shown keen interest in HAR, acknowledging its manifold utility and widespread adoption across various domains. This article aims to present an encompassing survey that focuses on CNNs and the evolution of Recurrent Neural Networks (RNNs) to ViTs given their importance in the domain of HAR. By conducting a thorough examination of existing literature and exploring emerging trends, this study undertakes a critical analysis and synthesis of the accumulated knowledge in this field. Additionally, it investigates the ongoing efforts to develop hybrid approaches. Following this direction, this article presents a novel hybrid model that seeks to integrate the inherent strengths of CNNs and ViTs.
Paper Structure (26 sections, 4 equations, 7 figures, 3 tables)

This paper contains 26 sections, 4 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Various types of RNN cells.
  • Figure 2: Types of RNN structures based on input-output pairs.
  • Figure 3: Sequence-to-sequence RNN with and without the attention mechanism.
  • Figure 4: Transformer architecture and its self-attention mechanism (adapted from vaswani2017attention).
  • Figure 5: The ViT architecture (adapted from dosovitskiy2020image).
  • ...and 2 more figures