Table of Contents
Fetching ...

Cross-Camera Distracted Driver Classification through Feature Disentanglement and Contrastive Learning

Luigi Celona, Simone Bianco, Paolo Napoletano

Abstract

The classification of distracted drivers is pivotal for ensuring safe driving. Previous studies demonstrated the effectiveness of neural networks in automatically predicting driver distraction, fatigue, and potential hazards. However, recent research has uncovered a significant loss of accuracy in these models when applied to samples acquired under conditions that differ from the training data. In this paper, we introduce a robust model designed to withstand changes in camera position within the vehicle. Our Driver Behavior Monitoring Network (DBMNet) relies on a lightweight backbone and integrates a disentanglement module to discard camera view information from features, coupled with contrastive learning to enhance the encoding of various driver actions. Experiments conducted using a leave-one-camera-out protocol on the daytime and nighttime subsets of the 100-Driver dataset validate the effectiveness of our approach. Cross-dataset and cross-camera experiments conducted on three benchmark datasets, namely AUCDD-V1, EZZ2021 and SFD, demonstrate the superior generalization capabilities of the proposed method. Overall DBMNet achieves an improvement of 7% in Top-1 accuracy compared to existing efficient approaches. Moreover, a quantized version of the DBMNet and all considered methods has been deployed on a Coral Dev Board board. In this deployment scenario, DBMNet outperforms alternatives, achieving the lowest average error while maintaining a compact model size, low memory footprint, fast inference time, and minimal power consumption.

Cross-Camera Distracted Driver Classification through Feature Disentanglement and Contrastive Learning

Abstract

The classification of distracted drivers is pivotal for ensuring safe driving. Previous studies demonstrated the effectiveness of neural networks in automatically predicting driver distraction, fatigue, and potential hazards. However, recent research has uncovered a significant loss of accuracy in these models when applied to samples acquired under conditions that differ from the training data. In this paper, we introduce a robust model designed to withstand changes in camera position within the vehicle. Our Driver Behavior Monitoring Network (DBMNet) relies on a lightweight backbone and integrates a disentanglement module to discard camera view information from features, coupled with contrastive learning to enhance the encoding of various driver actions. Experiments conducted using a leave-one-camera-out protocol on the daytime and nighttime subsets of the 100-Driver dataset validate the effectiveness of our approach. Cross-dataset and cross-camera experiments conducted on three benchmark datasets, namely AUCDD-V1, EZZ2021 and SFD, demonstrate the superior generalization capabilities of the proposed method. Overall DBMNet achieves an improvement of 7% in Top-1 accuracy compared to existing efficient approaches. Moreover, a quantized version of the DBMNet and all considered methods has been deployed on a Coral Dev Board board. In this deployment scenario, DBMNet outperforms alternatives, achieving the lowest average error while maintaining a compact model size, low memory footprint, fast inference time, and minimal power consumption.

Paper Structure

This paper contains 31 sections, 7 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The DBMNet pipeline. At inference time (a), the backbone encodes an RGB image into a feature vector $\mathbf{f}$. The feature disentanglement module then refines these features to discard view-related information while retaining action-related details using view queries, $\textbf{Q}$, together with the probability distribution over the views. The resulting features $\mathbf{\hat{f}}$ are mapped into a driver action by the action classifier. During training (b), alongside the cross-entropy loss for action and view prediction (respectively, $\mathcal{L}_{\rm{ace}}$ and $\mathcal{L}_{\rm{vce}}$), we employ the triplet loss ($\mathcal{L}_{\rm{ac}}$ and $\mathcal{L}_{\rm{vc}}$). This triplet loss helps to learn distinct action and view representations by processing three input images: an anchor image $a$, an image $sv$ depicting the same view but different action as $a$, and an image $sa$ with the same action but from a different view as $a$.
  • Figure 2: Sample images showcasing a driver engaged in both safe and distracted driving behavior, taken from the 100-Driver dataset wang2023100. The images are captured with four cameras installed in the same vehicle in both daytime (D1, D2, D3 and D4) and nighttime (N1, N2, N3 and N4) contexts.
  • Figure 3: Accuracy for the model trained on D1, D2, and D3 and tested on D4, as a function of the hyperparameters $\lambda_{\rm{ac}}$ and $\lambda_{\rm{vc}}$ defined in the Eq. \ref{['eq:loss']}. The red cross indicates the optimal configuration that yields the highest accuracy.
  • Figure 4: Confusion matrix for each cross-camera configuration (best viewed zoom in).
  • Figure 5: Sample images showing consistency in framing in the 100-Driver (using the D4 camera) wang2023100, AUCDD-V1 eraqi2019driver, EZZ2021 ezzouhri2021robust and SFD statefarm2020sfd datasets.
  • ...and 3 more figures