MultiFuser: Multimodal Fusion Transformer for Enhanced Driver Action Recognition
Ruoyu Wang, Wenqian Wang, Jianjun Gao, Dan Lin, Kim-Hui Yap, Bingbing Li
TL;DR
MultiFuser tackles driver action recognition under challenging car-cabin conditions by fusing RGB, IR, and Depth streams through a Transformer-based architecture. It introduces Bi-decomposed Modules with inter-modality (Modal Expertise ViT) and intra-modality (Patch-wise Adaptive Fusion) streams, plus a modality synthesizer that aggregates cross-modal features into a unified representation. The approach employs dynamic positional embeddings and patch-wise cross-modal fusion to capture fine-grained intermodal interactions, achieving state-of-the-art results on the Drive&Act dataset with a Mean-1 accuracy of 70.62% and Top-1 accuracy of 82.39%. Experimental results demonstrate the benefits of full multimodal input and parallel fusion design, validating the method's robustness to lighting variations and occlusions. The work advances practical multimodal DAR by enabling more reliable driver behavior understanding in real-world driving scenarios.
Abstract
Driver action recognition, aiming to accurately identify drivers' behaviours, is crucial for enhancing driver-vehicle interactions and ensuring driving safety. Unlike general action recognition, drivers' environments are often challenging, being gloomy and dark, and with the development of sensors, various cameras such as IR and depth cameras have emerged for analyzing drivers' behaviors. Therefore, in this paper, we propose a novel multimodal fusion transformer, named MultiFuser, which identifies cross-modal interrelations and interactions among multimodal car cabin videos and adaptively integrates different modalities for improved representations. Specifically, MultiFuser comprises layers of Bi-decomposed Modules to model spatiotemporal features, with a modality synthesizer for multimodal features integration. Each Bi-decomposed Module includes a Modal Expertise ViT block for extracting modality-specific features and a Patch-wise Adaptive Fusion block for efficient cross-modal fusion. Extensive experiments are conducted on Drive&Act dataset and the results demonstrate the efficacy of our proposed approach.
