Table of Contents
Fetching ...

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

Muhammad Bilal Shaikh, Syed Mohammed Shamsul Islam, Douglas Chai, Naveed Akhtar

TL;DR

This survey addresses the challenge of recognizing human actions from multimodal data (MHAR) by examining the shift from CNN-based to Transformer-based architectures and, crucially, by analyzing fusion strategies across modalities. It provides a taxonomy of CNN and Transformer MHAR methods, highlighting how early, middle, and late fusion are implemented within each paradigm and detailing cross-attention mechanisms that enable effective multimodal integration. The authors consolidate benchmarks, datasets, and architectural choices, offering design guidance for efficient, scalable MHAR models and identifying gaps such as data scarcity and the need for self-supervised and on-device approaches. The work aims to push MHAR research forward by clarifying architectural options, comparing CNNs and Transformers, and outlining practical pathways toward robust, real-world multimodal action recognition.

Abstract

Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of "fusing" the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

TL;DR

This survey addresses the challenge of recognizing human actions from multimodal data (MHAR) by examining the shift from CNN-based to Transformer-based architectures and, crucially, by analyzing fusion strategies across modalities. It provides a taxonomy of CNN and Transformer MHAR methods, highlighting how early, middle, and late fusion are implemented within each paradigm and detailing cross-attention mechanisms that enable effective multimodal integration. The authors consolidate benchmarks, datasets, and architectural choices, offering design guidance for efficient, scalable MHAR models and identifying gaps such as data scarcity and the need for self-supervised and on-device approaches. The work aims to push MHAR research forward by clarifying architectural options, comparing CNNs and Transformers, and outlining practical pathways toward robust, real-world multimodal action recognition.

Abstract

Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the last decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of "fusing" the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.
Paper Structure (27 sections, 9 equations, 5 figures, 3 tables)

This paper contains 27 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (Left) Number of relevant publications in recent years, identified with the data collected from the Web of Science. (Center) Categories of publication contributions to different sub-fields of Science - generated with data from the Web of Science. (Right) Distribution as document type (data collected from Scopus).
  • Figure 2: A typical multimodal fusion-based action recognition pipeline.
  • Figure 3: The Transformer, as originally proposed in vaswani2017attention, depicted through visualization selvasurvey2022videotrans.
  • Figure 4: An example of multi-level fusion in deep-learning-based action recognition, where the red dotted lines represent four integration points corresponding to different multimodal fusion methods examined. For a specific integration point, the network is duplicated for $K$ different modalities, concatenate the features at the integration point, and the network after the integration point remain unchanged. (adapted from LongGMLLLW18)
  • Figure 5: Four main types of performing multimodal fusion in Transformers (adapted from selvasurvey2022videotrans).