Table of Contents
Fetching ...

Towards Infusing Auxiliary Knowledge for Distracted Driver Detection

Ishwar B Balappanawar, Ashmit Chamoli, Ruwan Wickramarachchi, Aditya Mishra, Ponnurangam Kumaraguru, Amit P. Sheth

TL;DR

This work tackles distracted driver detection from in-vehicle video by introducing KiD3, a knowledge-infused framework that fuses semantic scene graphs and driver pose with visual features in a unified pipeline. KiD3 leverages a VGG-16 image encoder, RelTR-based scene graph generation, a Graph Convolutional Network, OpenPose-based pose estimation, and hand-crafted pose features to create a holistic representation for frame-level action classification into $18$ activities. In experiments on SynDDv1-like data, KiD3 achieves a peak accuracy of $90.5\%$, representing a relative improvement of $13.64\%$ over a vision-only baseline, and highlighting the value of auxiliary knowledge for robust DDD without resorting to large high-parameter models. The results support incorporating explicit semantic and pose information to enhance safety-critical driver monitoring systems, with future work exploring temporal graphs and Vision-Language Models.

Abstract

Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver's pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver pose information with the visual cues in video frames to create a holistic representation of the driver's actions.Our results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary knowledge with visual information.

Towards Infusing Auxiliary Knowledge for Distracted Driver Detection

TL;DR

This work tackles distracted driver detection from in-vehicle video by introducing KiD3, a knowledge-infused framework that fuses semantic scene graphs and driver pose with visual features in a unified pipeline. KiD3 leverages a VGG-16 image encoder, RelTR-based scene graph generation, a Graph Convolutional Network, OpenPose-based pose estimation, and hand-crafted pose features to create a holistic representation for frame-level action classification into activities. In experiments on SynDDv1-like data, KiD3 achieves a peak accuracy of , representing a relative improvement of over a vision-only baseline, and highlighting the value of auxiliary knowledge for robust DDD without resorting to large high-parameter models. The results support incorporating explicit semantic and pose information to enhance safety-critical driver monitoring systems, with future work exploring temporal graphs and Vision-Language Models.

Abstract

Distracted driving is a leading cause of road accidents globally. Identification of distracted driving involves reliably detecting and classifying various forms of driver distraction (e.g., texting, eating, or using in-car devices) from in-vehicle camera feeds to enhance road safety. This task is challenging due to the need for robust models that can generalize to a diverse set of driver behaviors without requiring extensive annotated datasets. In this paper, we propose KiD3, a novel method for distracted driver detection (DDD) by infusing auxiliary knowledge about semantic relations between entities in a scene and the structural configuration of the driver's pose. Specifically, we construct a unified framework that integrates the scene graphs, and driver pose information with the visual cues in video frames to create a holistic representation of the driver's actions.Our results indicate that KiD3 achieves a 13.64% accuracy improvement over the vision-only baseline by incorporating such auxiliary knowledge with visual information.
Paper Structure (28 sections, 5 figures, 2 tables)

This paper contains 28 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: This figure illustrates the process of extracting detailed information from a scene to analyze driver behavior. The extreme left panel shows an image of a driver which is sampled from the video. The middle left panel presents the corresponding estimated pose, highlighting how structured representations can be derived from raw image data. The middle right panel presents the object information obtained via object detection.The extreme right panel provides an sample relation from the scene graph, capturing the relationships between different objects and actions.
  • Figure 2: Camera mounting setup for the three views in the SynDD1 dataset: 1. Dashboard, 2. Behind rear view mirror, and 3. Top right side window.
  • Figure 3: Workflow of our proposed method. The figure illustrates the integration of an Image Encoder, Scene Graph Generator, GCN Graph Encoder, and Pose Estimators within our pipeline.
  • Figure 4: F1 scores and support for individual activity (i.e., Class 1 - 18) prediction across three methods, with Method 2 (i.e., Vision + SGG) and Method 3 (i.e., Vision + SGG + Pose Info) showing improvements over Method 1 (i.e., Vision only).
  • Figure :