Table of Contents
Fetching ...

MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition

Ruoyu Wang, Wenqian Wang, Jianjun Gao, Dan Lin, Kim-Hui Yap, Bingbing Li

TL;DR

MultiFuser tackles driver action recognition under challenging car-cabin conditions by fusing RGB, IR, and Depth streams through a Transformer-based architecture. It introduces Bi-decomposed Modules with inter-modality (Modal Expertise ViT) and intra-modality (Patch-wise Adaptive Fusion) streams, plus a modality synthesizer that aggregates cross-modal features into a unified representation. The approach employs dynamic positional embeddings and patch-wise cross-modal fusion to capture fine-grained intermodal interactions, achieving state-of-the-art results on the Drive&Act dataset with a Mean-1 accuracy of 70.62% and Top-1 accuracy of 82.39%. Experimental results demonstrate the benefits of full multimodal input and parallel fusion design, validating the method's robustness to lighting variations and occlusions. The work advances practical multimodal DAR by enabling more reliable driver behavior understanding in real-world driving scenarios.

Abstract

Driver action recognition, aiming to accurately identify drivers' behaviours, is crucial for enhancing driver-vehicle interactions and ensuring driving safety. Unlike general action recognition, drivers' environments are often challenging, being gloomy and dark, and with the development of sensors, various cameras such as IR and depth cameras have emerged for analyzing drivers' behaviors. Therefore, in this paper, we propose a novel multimodal fusion transformer, named MultiFuser, which identifies cross-modal interrelations and interactions among multimodal car cabin videos and adaptively integrates different modalities for improved representations. Specifically, MultiFuser comprises layers of Bi-decomposed Modules to model spatiotemporal features, with a modality synthesizer for multimodal features integration. Each Bi-decomposed Module includes a Modal Expertise ViT block for extracting modality-specific features and a Patch-wise Adaptive Fusion block for efficient cross-modal fusion. Extensive experiments are conducted on Drive&Act dataset and the results demonstrate the efficacy of our proposed approach.

MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition

TL;DR

MultiFuser tackles driver action recognition under challenging car-cabin conditions by fusing RGB, IR, and Depth streams through a Transformer-based architecture. It introduces Bi-decomposed Modules with inter-modality (Modal Expertise ViT) and intra-modality (Patch-wise Adaptive Fusion) streams, plus a modality synthesizer that aggregates cross-modal features into a unified representation. The approach employs dynamic positional embeddings and patch-wise cross-modal fusion to capture fine-grained intermodal interactions, achieving state-of-the-art results on the Drive&Act dataset with a Mean-1 accuracy of 70.62% and Top-1 accuracy of 82.39%. Experimental results demonstrate the benefits of full multimodal input and parallel fusion design, validating the method's robustness to lighting variations and occlusions. The work advances practical multimodal DAR by enabling more reliable driver behavior understanding in real-world driving scenarios.

Abstract

Driver action recognition, aiming to accurately identify drivers' behaviours, is crucial for enhancing driver-vehicle interactions and ensuring driving safety. Unlike general action recognition, drivers' environments are often challenging, being gloomy and dark, and with the development of sensors, various cameras such as IR and depth cameras have emerged for analyzing drivers' behaviors. Therefore, in this paper, we propose a novel multimodal fusion transformer, named MultiFuser, which identifies cross-modal interrelations and interactions among multimodal car cabin videos and adaptively integrates different modalities for improved representations. Specifically, MultiFuser comprises layers of Bi-decomposed Modules to model spatiotemporal features, with a modality synthesizer for multimodal features integration. Each Bi-decomposed Module includes a Modal Expertise ViT block for extracting modality-specific features and a Patch-wise Adaptive Fusion block for efficient cross-modal fusion. Extensive experiments are conducted on Drive&Act dataset and the results demonstrate the efficacy of our proposed approach.
Paper Structure (17 sections, 11 equations, 3 figures, 3 tables)

This paper contains 17 sections, 11 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: For multimodal video sequences (such as RGB and IR), the Patch-wise Adaptive Fusion (PAF) can integrate information from different modalities on a per-patch basis to form a multimodal feature representation for each patch. Subsequently, through Multimodal Integration, all these features are aggregated to achieve a comprehensive understanding of the entire video, which is then utilized for driver action recognition.
  • Figure 2: Overview of MultiFuser, a network proposed to achieve a comprehensive multimodal representation for driver action recognition. For multimodal patch tokens input, it designs $N$ layers of a Bi-decomposed Module to model its spatiotemporal features with a modality synthesizer for multimodal features integration. Each Bi-decomposed Module comprises a Modal Expertise ViT block within inter-modality decomposition to extract the modality-specific features and a Patch-wise Adaptive Fusion block within the intra-modality decomposition stream for efficient cross-modal fusion. Finally, the modality synthesizer integrates these cross-modal features and then combines with the [CLS] token from each modality, providing a comprehensive and holistic multimodal representation of driver actions.
  • Figure 3: Different connection structure in Bi-decomposed Module. (a) MultiFuser cascade extracts the unimodal features (in Inter-modality Decomposition) first and then fuses them (in Intra-modality Decomposition). The fused features are then input into the next Bi-decomposed Module for further unimodal features extraction and fusion. (b) MultiFuser parallel designs a parallel structure to extract the unimodal and cross-modal features simultaneously which is able to keep the modality-specific features untouched.