ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection

Yunsheng Ma; Ziran Wang

ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection

Yunsheng Ma, Ziran Wang

TL;DR

This work presents ViT-DD, a pure Vision Transformer-based framework for driver distraction detection that jointly leverages driver emotion signals through a semi-supervised, pseudo-labeled multi-task training scheme. By integrating two input modalities (driver and face) and two tasks (distraction detection and emotion recognition), ViT-DD achieves state-of-the-art performance on SFDDD and AUCDD, notably under the challenging split-by-driver setting. The method uses a FER teacher to generate pseudo emotion labels for unlabeled face data and trains a single ViT-DD model with multi-task objectives, achieving notable generalization improvements while offering interpretable attention visualizations. The results suggest that incorporating emotion information via multi-task learning can substantially enhance real-time driver monitoring for ADAS/ADS systems, with potential extensions to gaze and head pose cues.

Abstract

Ensuring traffic safety and mitigating accidents in modern driving is of paramount importance, and computer vision technologies have the potential to significantly contribute to this goal. This paper presents a multi-modal Vision Transformer for Driver Distraction Detection (termed ViT-DD), which incorporates inductive information from training signals related to both distraction detection and driver emotion recognition. Additionally, a self-learning algorithm is developed, allowing for the seamless integration of driver data without emotion labels into the multi-task training process of ViT-DD. Experimental results reveal that the proposed ViT-DD surpasses existing state-of-the-art methods for driver distraction detection by 6.5% and 0.9% on the SFDDD and AUCDD datasets, respectively.

ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection

TL;DR

Abstract

Paper Structure (16 sections, 3 equations, 3 figures, 2 tables)

This paper contains 16 sections, 3 equations, 3 figures, 2 tables.

Introduction
Background
Vision Transformer
Multi-Task Learning and Self-Training
Facial Expression Recognition and Driver Distraction Detection
Methodology
Model Overview
Pseudo-Labeled Multi-Task Training
Experiments and Results
Benchmarks
Baselines
Implementation Details
Comparison with State-of-the-Art
Ablation Study
Visualization
...and 1 more sections

Figures (3)

Figure 1: (Left) The framework of the proposed ViT-DD: First, a face detector is applied to the input signal from an in-cabin camera to acquire the driver's facial area. Then, the driver and face images are divided into patches and independently embedded into visual tokens. Next, the driver and face embeddings are added with their respective position embeddings, and the resulting sequence is concatenated. In addition, tokens representing distractions and emotions are prepended. The sequence of class and visual tokens are then iteratively updated through $L$ Transformer layers. The class tokens from the final sequence are used to recognize the driver's distraction and emotion states through their corresponding MLP heads. (Right) Pseudo-labeled Multi-Task Learning: A well-trained Facial Expression Recognition Teacher ViT is employed to label the unlabeled drivers' face images in order to create a multi-task driver dataset. The dataset containing both ground-truth distraction labels and pseudo emotion labels is then applied to train a student ViT-DD model with multi-task learning.
Figure 2: This figure displays the attention maps generated by a well-trained ViT-DD model during inference on the AUCDD dataset. The attention maps depict how the distraction token interacts with other visual tokens across the $L=12$ Transformer layers of ViT-DD. The colors used in the maps correspond to the level of attention, with red indicating high attention and blue indicating low attention. The maps show that as the network grows deeper, the distraction token focuses on precise local cues rather than the entire image signal. The model successfully concentrates on critical areas of the input images, such as the driving wheel region in the first scenario and the phone region in the second scenario.
Figure 3: Confusion matrices of the standard ViT and ViT-DD on AUCDD

ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection

TL;DR

Abstract

ViT-DD: Multi-Task Vision Transformer for Semi-Supervised Driver Distraction Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)