Table of Contents
Fetching ...

Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention

Yiming Ma, Victor Sanchez, Soodeh Nikan, Devesh Upadhyay, Bhushan Atote, Tanaya Guha

TL;DR

The paper tackles robust driver monitoring in Level-2+ systems by addressing the fusion of multiview multimodal data. It introduces a feature-level fusion approach based on multi-head self-attention (MHSA) and a GPU-friendly supervised contrastive learning framework, SuMoCo, augmented with fine-grained NDRA annotations on the DAD dataset. Results show that MHSA achieves state-of-the-art performance with $AUC\text{-}ROC = 97.0\%$ and $mAP = 97.8\%$ when using all data sources, and patch masking further improves robustness to view/modality collapses. The work advances practical DMS deployment by enabling robust, multi-class NDRA recognition and providing insights into modality importance and fusion strategy effectiveness, while noting limitations such as data imbalance and overfitting.

Abstract

Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors mounted at different locations to monitor the driver and the vehicle's interior scene and employ decision-level fusion to integrate these heterogenous data. However, this fusion method may not fully utilize the complementarity of different data sources and may overlook their relative importance. To address these limitations, we propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA). We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Conv, SE, and AFF). We also present a novel GPU-friendly supervised contrastive learning framework SuMoCo to learn better representations. Furthermore, We fine-grained the test split of the DAD dataset to enable the multi-class recognition of drivers' activities. Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC: 97.0\%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses. The code and annotations are publicly available.

Robust Multiview Multimodal Driver Monitoring System Using Masked Multi-Head Self-Attention

TL;DR

The paper tackles robust driver monitoring in Level-2+ systems by addressing the fusion of multiview multimodal data. It introduces a feature-level fusion approach based on multi-head self-attention (MHSA) and a GPU-friendly supervised contrastive learning framework, SuMoCo, augmented with fine-grained NDRA annotations on the DAD dataset. Results show that MHSA achieves state-of-the-art performance with and when using all data sources, and patch masking further improves robustness to view/modality collapses. The work advances practical DMS deployment by enabling robust, multi-class NDRA recognition and providing insights into modality importance and fusion strategy effectiveness, while noting limitations such as data imbalance and overfitting.

Abstract

Driver Monitoring Systems (DMSs) are crucial for safe hand-over actions in Level-2+ self-driving vehicles. State-of-the-art DMSs leverage multiple sensors mounted at different locations to monitor the driver and the vehicle's interior scene and employ decision-level fusion to integrate these heterogenous data. However, this fusion method may not fully utilize the complementarity of different data sources and may overlook their relative importance. To address these limitations, we propose a novel multiview multimodal driver monitoring system based on feature-level fusion through multi-head self-attention (MHSA). We demonstrate its effectiveness by comparing it against four alternative fusion strategies (Sum, Conv, SE, and AFF). We also present a novel GPU-friendly supervised contrastive learning framework SuMoCo to learn better representations. Furthermore, We fine-grained the test split of the DAD dataset to enable the multi-class recognition of drivers' activities. Experiments on this enhanced database demonstrate that 1) the proposed MHSA-based fusion method (AUC-ROC: 97.0\%) outperforms all baselines and previous approaches, and 2) training MHSA with patch masking can improve its robustness against modality/view collapses. The code and annotations are publicly available.
Paper Structure (15 sections, 13 equations, 7 figures, 4 tables)

This paper contains 15 sections, 13 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: An overview of our proposed DMS: R3D-18 tran2018closer backbones are utilized to extract spatial-temporal features from the multiview multimodal inputs. These feature maps are subsequently fused via multi-head self-attention ( illustrated in Figure \ref{['fig:mhsa']}). A 2-layer perceptron is leveraged to project the fused features into the contrastive embedding sapce, while another 2-layer perceptron generates the score for each class. The orange blocks constitute the query encoder of our proposed contrastive learning framework, SuMoCo. They are supervised using the infoNCE loss oord2018representation, and the classifier is trained separately using the focal loss lin2017focal.
  • Figure 2: The structure of our proposed multi-head self-attention feature fusion module MHSA. We first split the extracted feature maps into fixed-size patches and add the source embedding and the positional embedding to them. Next, we randomly mask some patches and feed the remaining into the transformer encoder block vaswani2017attentiondosovitskiy2021an to learn interactions among them and capture the global context. This masking operation simulates view/modality collpase, leading to improved robustness. Finally, the attended patches are summed to generate the output.
  • Figure 3: The structure of our proposed squeeze-and-excitation feature fusion module SE. Feature maps are first concatenated along the channel axis. We leverage the squeeze-and-excitation mechanism hu2018squeeze to learn the weight for each channel. The weight matrices are then used to average the input feature maps. Through this way, our method can model the interaction between different views and modalities and learn the corresponding relative importance.
  • Figure 4: The distribution of the fine-grained classses. The label "normal" refers to normal driving, and the other nine are non-driving-ralated activities: "radio": tuning the radio; "backseat": reaching the back seat; "talking": talking with the passenger; "phone": talking on the phone. The "L/R" suffix stand for with the left/right hand, and those NDRAs only in test set are labeled as "unseen".
  • Figure 5: Masked training improves MHSA's robustness against corrupt views/modalities. MHSA is trained with all four data sources in DAD and a varying mask ratios ranging from $0.0$ (i.e., no masking) to $0.9$ (i.e., $90\%$ of the patches are masked). In testing, it is evaluated with zero to three data sources collapsed. Thus, small performance degradation indicates stronger robustness against corrupt data sources. The $x$-axis in the resulting plots indicates the mask ratio, and the $y$-axis displays the corresponding average score.
  • ...and 2 more figures