Towards Infusing Auxiliary Knowledge for Distracted Driver Detection
Ishwar B Balappanawar, Ashmit Chamoli, Ruwan Wickramarachchi, Aditya Mishra, Ponnurangam Kumaraguru, Amit P. Sheth
TL;DR
This work tackles distracted driver detection from in-vehicle video by introducing KiD3, a knowledge-infused framework that fuses semantic scene graphs and driver pose with visual features in a unified pipeline. KiD3 leverages a VGG-16 image encoder, RelTR-based scene graph generation, a Graph Convolutional Network, OpenPose-based pose estimation, and hand-crafted pose features to create a holistic representation for frame-level action classification into $18$ activities. In experiments on SynDDv1-like data, KiD3 achieves a peak accuracy of $90.5\%$, representing a relative improvement of $13.64\%$ over a vision-only baseline, and highlighting the value of auxiliary knowledge for robust DDD without resorting to large high-parameter models. The results support incorporating explicit semantic and pose information to enhance safety-critical driver monitoring systems, with future work exploring temporal graphs and Vision-Language Models.
Abstract
Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver's pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver pose information with the visual cues in video frames to create a holistic representation of the driver's actions.Our results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary knowledge with visual information.
