Table of Contents
Fetching ...

Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

Md Zahid Hasan, Jiajing Chen, Jiyang Wang, Mohammed Shaiqur Rahman, Ameya Joshi, Senem Velipasalar, Chinmay Hegde, Anuj Sharma, Soumik Sarkar

TL;DR

This work tackles the challenge of identifying distracted driving from naturalistic videos under limited labeled data by leveraging vision-language models, specifically CLIP. It proposes two families of frameworks: frame-based (Zero-shotCLIP, Single-frameCLIP, Multi-frameCLIP) and a video-based model (VideoCLIP), all built on frozen CLIP visual encoders with a task-specific classifier on top and temporal aggregation. Through extensive experiments on four public datasets with driver-out cross-validation, the study shows that temporal models, especially VideoCLIP, achieve state-of-the-art Top-1 accuracy (e.g., up to 98.44% on DMD and 97.86% on SAM-DD) and robust performance with reduced training data, outperforming traditional CNN-based baselines. The results demonstrate the practical potential of secure, data-efficient, multimodal learning for real-world driver monitoring, while outlining limitations and directions for broader action sets, multi-distraction scenarios, and uncertainty-aware deployments.

Abstract

Recognizing the activities causing distraction in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically data-intensive and require a large volume of annotated training data to detect and classify various distracted driving behaviors, thereby limiting their efficiency and scalability. We aim to develop a generalized framework that showcases robust performance with access to limited or no annotated training data. Recently, vision-language models have offered large-scale visual-textual pretraining that can be adapted to task-specific learning like distracted driving activity recognition. Vision-language pretraining models, such as CLIP, have shown significant promise in learning natural language-guided visual representations. This paper proposes a CLIP-based driver activity recognition approach that identifies driver distraction from naturalistic driving images and videos. CLIP's vision embedding offers zero-shot transfer and task-based finetuning, which can classify distracted activities from driving video data. Our results show that this framework offers state-of-the-art performance on zero-shot transfer and video-based CLIP for predicting the driver's state on two public datasets. We propose both frame-based and video-based frameworks developed on top of the CLIP's visual representation for distracted driving detection and classification tasks and report the results.

Vision-Language Models can Identify Distracted Driver Behavior from Naturalistic Videos

TL;DR

This work tackles the challenge of identifying distracted driving from naturalistic videos under limited labeled data by leveraging vision-language models, specifically CLIP. It proposes two families of frameworks: frame-based (Zero-shotCLIP, Single-frameCLIP, Multi-frameCLIP) and a video-based model (VideoCLIP), all built on frozen CLIP visual encoders with a task-specific classifier on top and temporal aggregation. Through extensive experiments on four public datasets with driver-out cross-validation, the study shows that temporal models, especially VideoCLIP, achieve state-of-the-art Top-1 accuracy (e.g., up to 98.44% on DMD and 97.86% on SAM-DD) and robust performance with reduced training data, outperforming traditional CNN-based baselines. The results demonstrate the practical potential of secure, data-efficient, multimodal learning for real-world driver monitoring, while outlining limitations and directions for broader action sets, multi-distraction scenarios, and uncertainty-aware deployments.

Abstract

Recognizing the activities causing distraction in real-world driving scenarios is critical for ensuring the safety and reliability of both drivers and pedestrians on the roadways. Conventional computer vision techniques are typically data-intensive and require a large volume of annotated training data to detect and classify various distracted driving behaviors, thereby limiting their efficiency and scalability. We aim to develop a generalized framework that showcases robust performance with access to limited or no annotated training data. Recently, vision-language models have offered large-scale visual-textual pretraining that can be adapted to task-specific learning like distracted driving activity recognition. Vision-language pretraining models, such as CLIP, have shown significant promise in learning natural language-guided visual representations. This paper proposes a CLIP-based driver activity recognition approach that identifies driver distraction from naturalistic driving images and videos. CLIP's vision embedding offers zero-shot transfer and task-based finetuning, which can classify distracted activities from driving video data. Our results show that this framework offers state-of-the-art performance on zero-shot transfer and video-based CLIP for predicting the driver's state on two public datasets. We propose both frame-based and video-based frameworks developed on top of the CLIP's visual representation for distracted driving detection and classification tasks and report the results.
Paper Structure (30 sections, 7 figures, 14 tables)

This paper contains 30 sections, 7 figures, 14 tables.

Figures (7)

  • Figure 1: An overview of the image-based and video-based CLIP frameworks. The yellow box shows the sampled frames from a 10-second event, i.e., a total of 300 frames at a rate of 30 frames per second. These frames are sampled from a raw video. The VideoCLIP uses 30 consecutive frames to create a short video clip as input, while the Single-frameCLIP takes a single frame at a time. For the very first sample, the VideoCLIP generates a video clip by repeating Image-1 sixteen times and including Images 2 to 15 (a total of 30 frames in the green box). The SingleframeCLIP only uses Image-1 as its first input and makes the prediction. This process continues for each subsequent frame. Note that the prediction for an entire event is made by calculating the majority vote from all individual frame predictions. The Multi-frameCLIP applies majority voting on SingleframeCLIP's predictions and VideoCLIP applies majority voting on its output to get the final event prediction.
  • Figure 2: The proposed multimodal Single-frameCLIP framework exploits the pretrained CLIP architecture to extract vision-text semantic information.
  • Figure 3: The mean and variance of Top-1 accuracy of the Single-frameCLIP and CNN baselines on the StateFarm dataset.
  • Figure 4: Performance comparison of the Single-frameCLIP model and traditional CNN models trained on varying proportions of the StateFarm dataset. It shows average accuracies for 8-fold cross-validation.
  • Figure 5: Confusion matrix of Single-frameCLIP model on DMD dataset. It shows that the "reaching to backseat", "reaching to side" and "driving safely" classes were the most challenging for the Single-frameCLIP model. Also, the dataset has limited "yawning" samples compared to the other classes. Additionally, we noticed that only two SAM-DD classes "head dropping" and "touching hair" got F1-scores below 0.70 among the ten classes.
  • ...and 2 more figures