ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection
Yunsheng Ma, Ziran Wang
TL;DR
This work presents ViT-DD, a pure Vision Transformer-based framework for driver distraction detection that jointly leverages driver emotion signals through a semi-supervised, pseudo-labeled multi-task training scheme. By integrating two input modalities (driver and face) and two tasks (distraction detection and emotion recognition), ViT-DD achieves state-of-the-art performance on SFDDD and AUCDD, notably under the challenging split-by-driver setting. The method uses a FER teacher to generate pseudo emotion labels for unlabeled face data and trains a single ViT-DD model with multi-task objectives, achieving notable generalization improvements while offering interpretable attention visualizations. The results suggest that incorporating emotion information via multi-task learning can substantially enhance real-time driver monitoring for ADAS/ADS systems, with potential extensions to gaze and head pose cues.
Abstract
Ensuring traffic safety and mitigating accidents in modern driving is of paramount importance, and computer vision technologies have the potential to significantly contribute to this goal. This paper presents a multi-modal Vision Transformer for Driver Distraction Detection (termed ViT-DD), which incorporates inductive information from training signals related to both distraction detection and driver emotion recognition. Additionally, a self-learning algorithm is developed, allowing for the seamless integration of driver data without emotion labels into the multi-task training process of ViT-DD. Experimental results reveal that the proposed ViT-DD surpasses existing state-of-the-art methods for driver distraction detection by 6.5% and 0.9% on the SFDDD and AUCDD datasets, respectively.
