Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention
Yiming Ma, Victor Sanchez, Soodeh Nikan, Devesh Upadhyay, Bhushan Atote, Tanaya Guha
TL;DR
The paper tackles robust driver monitoring in Level-2+ systems by addressing the fusion of multiview multimodal data. It introduces a feature-level fusion approach based on multi-head self-attention (MHSA) and a GPU-friendly supervised contrastive learning framework, SuMoCo, augmented with fine-grained NDRA annotations on the DAD dataset. Results show that MHSA achieves state-of-the-art performance with $AUC\text{-}ROC = 97.0\%$ and $mAP = 97.8\%$ when using all data sources, and patch masking further improves robustness to view/modality collapses. The work advances practical DMS deployment by enabling robust, multi-class NDRA recognition and providing insights into modality importance and fusion strategy effectiveness, while noting limitations such as data imbalance and overfitting.
Abstract
Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors mounted at different locations to monitor the driver and the vehicle's interior scene and employ decision-level fusion to integrate these heterogenous data. However, this fusion method may not fully utilize the complementarity of different data sources and may overlook their relative importance. To address these limitations, we propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA). We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Conv, SE, and AFF). We also present a novel GPU-friendly supervised contrastive learning framework SuMoCo to learn better representations. Furthermore, We fine-grained the test split of the DAD dataset to enable the multi-class recognition of drivers' activities. Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC: 97.0\%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses. The code and annotations are publicly available.
