Driver Activity Classification Using Generalizable Representations from Vision-Language Models

Ross Greer; Mathias Viborg Andersen; Andreas Møgelmose; Mohan Trivedi

Driver Activity Classification Using Generalizable Representations from Vision-Language Models

Ross Greer, Mathias Viborg Andersen, Andreas Møgelmose, Mohan Trivedi

TL;DR

This work addresses driver activity classification by leveraging generalizable vision-language representations through a Semantic Representation Late Fusion Network (SRLF-Net) that fuses embeddings from multiple camera perspectives. By using a CLIP-based encoder and order-based augmentation, the method emphasizes semantic information over driver-specific visual traits, improving cross-driver generalization. Evaluated on the AI City Challenge Naturalistic Driving Action Recognition dataset, the approach achieves a 7-fold average accuracy of 71.64% (std 2.88), with post-processing mode filtering boosting performance to 77.10% in best configurations, and maintains competitive discrimination when normal driving is removed. The findings highlight the promise of vision-language representations for robust, interpretable driver monitoring applicable to ADAS and autonomous control transitions, with future work targeting temporal modeling and open-set expansion.

Abstract

Driver activity classification is crucial for ensuring road safety, with applications ranging from driver assistance systems to autonomous vehicle control transitions. In this paper, we present a novel approach leveraging generalizable representations from vision-language models for driver activity classification. Our method employs a Semantic Representation Late Fusion Neural Network (SRLF-Net) to process synchronized video frames from multiple perspectives. Each frame is encoded using a pretrained vision-language encoder, and the resulting embeddings are fused to generate class probability predictions. By leveraging contrastively-learned vision-language representations, our approach achieves robust performance across diverse driver activities. We evaluate our method on the Naturalistic Driving Action Recognition Dataset, demonstrating strong accuracy across many classes. Our results suggest that vision-language representations offer a promising avenue for driver monitoring systems, providing both accuracy and interpretability through natural language descriptors.

Driver Activity Classification Using Generalizable Representations from Vision-Language Models

TL;DR

Abstract

Paper Structure (13 sections, 5 figures, 2 tables, 1 algorithm)

This paper contains 13 sections, 5 figures, 2 tables, 1 algorithm.

Introduction
Related Research
Methodology
Algorithm
Semantic Representation Late Fusion Neural Network
Leveraging Generalizable Representations from Language-Vision Foundation Models
Separating Visual and Semantic Information using Order-based Augmentation
Experimental Evaluation
Dataset
Training Details
Evaluation Over All Classes
Distracting Activities Only: Evaluating Without Normal Driving Class
Concluding Remarks and Future Research

Figures (5)

Figure 1: The Semantic Representation Late Fusion Network (SRLF-Net) takes images from multiple perspectives as input. Each image is sent to a CLIP encoder. Our experiments use the Vision Transformer backbone, base size, with size 32 patches. These representations are then further encoded using independent (non-shared-weight) fully-connected layers, each followed by batch normalization, ReLU activation, and dropout (rates 0.5 and 0.6 respectively). We use input size 768, and use two layers, compressing once to 512 and then to 256. These representations are then concatenated and used as input to another series of fully-connected layers (fusion step), again using batch normalization and ReLU activation between each. The size of these layers are 768, 768, 512, 256, 128, then $n$ (number of classes), which is 16 for our experiments.
Figure 2: Illustration of multi-perspective in-cabin camera views for monitoring driver behavior under the class '0: Normal Forward Driving'. (1) Dashboard view. (2) Rear-view. (3) Side view.
Figure 3: Confusion matrix for best performing k-fold 6 including a mode filter, resulting in a performance of 77.10 %.
Figure 4: Binary Confusion matrix for best performing k-fold 6 only including class 0 for straight forward driving and a combination of all other activity classes, performing 77.22 % accuracy.
Figure 5: Confusion matrix for best performing k-fold 6 without class 0 for straight forward driving and including a mode filter, performing 70.06% accuracy. By removing the forward driving class, the accuracy metric decreases slightly (simply because the over-predicted forward driving class accounted for a majority of the dataset), but the average performance over classes actually increases from 60.44% to 70.13%. The alignment of average per-class accuracy and overall accuracy is a strong indicator of the model's effective learning.

Driver Activity Classification Using Generalizable Representations from Vision-Language Models

TL;DR

Abstract

Driver Activity Classification Using Generalizable Representations from Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)